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Abstract — We consider the problem of universally communi- 
cating over an unknown and arbitrarily varying channel, using 
feedback. The focus of this paper is on determining the input 
behavior, and specifically, a prior distribution which is used 
to randomly generate the codebook. We pose the problem of 
setting the prior as a sequential universal prediction problem, 
that attempts to approach a given target rate, which depends 
on the unknown channel sequence. The main result is that, 
for a channel comprised of an unknown, arbitrary sequence 
of memoryless channels, there is a system using feedback and 
common randomness that asymptotically attains, with high prob- 
ability, the capacity of the time-averaged channel, universally for 
every sequence of channels. While no prior knowledge of the 
channel sequence is assumed, the rate achieved meets or exceeds 
the traditional arbitrarily varying channel (AVC) capacity for 
every memoryless AVC defined over the same alphabets, and 
therefore the system universally attains the random code AVC 
capacity, without knowledge of the AVC parameters. The system 
we present combines rateless coding with a universal prediction 
scheme for the prior. We present rough upper bounds on the 
rates that can be achieved in this setting and lower bounds for 
the redundancies. 

I. Introduction 

We consider the problem of communicating over an un- 
known and arbitrarily varying channel, with the help of 
feedback. We would like to minimize the assumptions on 
the communication channel as much as possible, while using 
the feedback link to learn the channel. The main questions 
with respect to such channels are how to define the expected 
communication rates, and how to attain them universally, 
without channel knowledge. 

The traditional models for unknown channels [l] are com- 
pound channels, in which the channel law is selected arbitrarily 
out of a family of known channels, and arbitrarily varying 
channels (AVC's), in which a sequence of channel states is 
selected arbitrarily. The well known results for these models 
|[T] do not assume adaptation. Therefore, the AVC capacity, 
which is the supremum of the communication rates that can 
be obtained with vanishing error probability over any possible 
occurrence of the channel state sequence, is in essence a 
worst-case result. For example, if one assumes that i/i, the 
channel output at time i, is determined by the probability law 
Wi{yi\xi) where Xi is the channel input, and Wi is an arbitrary 
sequence of conditional distributions, clearly no positive rate 
can be guaranteed a-priori, as it may happen that all Wi have 
zero capacity, and therefore the AVC capacity is zero. This 
capacity may be non-zero only if a constraint on Wi is defined. 
In this paper we use the term "arbitrarily varying channel" in 



a loose manner, to describe any kind of unknown and arbitrary 
change of the channel over time, and the acronym "AVC" to 
refer to the traditional model [l]. 

Other communication models, which allow positive commu- 
nication rates over such AVC's were proposed by the authors 
and others [2J, |3|, |4l, ||5|. Although the channel models 
considered in these papers are different, the common feature 
distinguishing them from the traditional AVC setting is that 
the communication rate is adaptively modified using feedback. 
The target rate is known only a-posteriori, and is gradually 
learned throughout the communication process. By adapting 
the rate, one avoids worst case assumptions on the channel, 
and can achieve positive communication rates when the chan- 
nel is good. However, in the aforementioned communication 
models, the distribution of the transmitted signal is fixed and 
independent of the feedback, and only the rate is adapted. 
Specifically in the "individual channel" model (4] for reasons 
explained therein, the distribution of the channel input is fixed 
to a predefined prior Likewise, Eswaran et al lO show that for 
a fixed prior, the mutual information of the averaged channel 
can be attained. Clearly, with this limitation these systems 
are incapable of universally attaining the channel capacity in 
many cases of interest. For example, consider even the simple 
case where the channel is a compound memoryless channel, 
i.e. the conditional distributions Wi = W are all constant but 
unknown. 

In the last paper ||5l, the problem of universal commu- 
nication was formulated as that of a competition against a 
reference system, comprised of an encoder and a decoder with 
limited capabilities. For the case where the channel is modulo- 
additive with an individual, arbitrary noise sequence, it was 
shown possible to asymptotically perform at least as well 
as any finite-block system (which may be designed knowing 
the noise sequence), without prior knowledge of the noise 
sequence. However, this result crucially relies on the property 
of the modulo-additive channel, that the capacity achieving 
prior is the uniform i.i.d. prior for any noise distribution. To 
extend the result to more general models, we would like to 
be able to adapt the input behavior The key parameter to be 
adapted is the "prior", i.e. the distribution of the codebook 
(or equivalently the channel input), since it plays a vital role 
in the converse as well as the attainability proof of channel 
capacity and is the main factor in adapting the message to the 
channel |16|. 

In a crude way we may say that previous works achieve 
various kinds of "mutual information" for a fixed prior and 
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any channel from a wide class, by mainly solving problems 
of universal decoding and rate adaptation. However to obtain 
more than the "mutual information", i.e. the "capacity", one 
would need to select the prior in a universal way. 

Prior adaptation using feedback is a well known practice 
for static or semi-static channels. Two familiar examples are 
bit and power loading performed in Digital Subscriber Lines 
(DSL-s) |7|, and precoding for in multi-antenna systems fSl 
which is performed in practice in wireless standards such as 
WiFi, WiMAX and LTE. If the channel can be assumed to be 
static for a period of time sufficient to close a loop of channel 
measurement, feedback and coding, then an input prior close 
to the optimal one can be chosen. In the theoretical setting 
of the compound memoryless channel where Pr(yi|Xi) — 
W{Yi\Xi), where W is unknown but fixed, a system with 
feedback can asymptotically attain the channel capacity of 
W , without prior knowledge of it, by using an asymptotically 
small portion of the transmission time to estimate the channel, 
and using an estimate of the optimal prior and the suitable rate 
during the rest of the time f9l. All models for prior adaptation 
that we are aware of, use the assumption that the knowledge 
of the channel at a given time yields non trivial statistical 
information about future channel states, but do not deal with 
arbitrary variation. 

The question that we deal with in this paper is: assuming a 
channel which is arbitrarily changing over time, is there any 
merit in using feedback to adapt the input distribution, and 
what rates can be guaranteed? As a target, we would have 
liked to consider the most general variation of the channel 
(as in the unknown vector channel model |5|), however to 
start our exploration, we focus on channel models which are 
memoryless in the input, i.e. whose behavior at a certain time 
does not depend on any previous channel inputs. The most 
general model that does not include memory of the input is 
that of an unknown sequence of memoryless channels (which 
is in essence an AVC without constraints) and this is the main 
model considered in this paper. The motivation for avoiding 
memory of the input can be appreciated by considering the 
negative examples in |5|. 

We now give a brief overview of the structure and the results 
of this paper. In Section |ll] we state the problem, and define 
several communication rates (as a function of the channel 
sequence) that would be of interest. In order to focus thoughts 
on questions related to the problem of determining the prior, 
we initially adopt an abstract model of the communication 
system, stripping off the details of communication, such as 
decoding, channel estimation, overheads, error probability, etc. 
We begin by presenting an easier synthetic problem, in which 
all previous channels are known (Section [III]). This problem 
may represent a channel which changes its behavior in a block- 
wise manner and remains i.i.d. memoryless during each block 
(a subset of the original problem). This problem is related 



to standard prediction problems (Section III-B i, and used as 



a tool to gain insight into the prediction problem involved, 
present bounds on what can be achieved universally, and 
develop the techniques that will be used later on. Furthermore, 
we show that even for this easier problem there is no hope 
to attain the channel capacity universally and we would 



have to settle for lower rates (Section |III-C[ ). The attained 
rate is the maximum over the prior, of the averaged mutual 
information (Theorem [T]). In Section |IV[ we return to the 
main problem, and show that the rate that can be attained 
when the past channel is not known, but is estimated from 
the output, is lower. We focus on the capacity of the time- 
averaged channel. We show this rate is the best achievable rate 
that does not depend on the order of the channel sequence 
(Theorem |2]), and present the main result showing that this 
rate is indeed achievable (Theorem [3]|. Furthermore, this rate 
meets or exceeds the AVC capacity, and essentially equals 
the "empirical capacity" defined by Eswaran et al |3|. We 
present a scheme based on rateless coding and combines a 



prior predictor that attains this rate. In Section IV-C the prior 
predictor is developed under abstract assumptions regarding 
the channel estimation and decoding rate. In Section [V] we 
present and analyze the full communication system and prove 
the main result. Finally, Section |VI] is devoted to discussion 
and comments. 

II. Notation and problem statement 
A. Notation 

We denote random variables by capital letters and vectors 
by boldface. However for probabilities which are sometimes 
treated as vectors we use regular capital letters. We apply 
superscript and subscript indices to vectors to define sub- 
sequences in the standard way, i.e. = {xi,Xi^i, ...,Xj), 
X — XI 

I{Q, W) denotes the mutual information obtained when 
using a prior Q over a channel W, i.e. it is the mutual informa- 
tion I{Q, W) — I{X; Y) between two random variables with 
the joint probability Pr(X, F) = Q(X) ■ W{Y\X). C{W) 
denotes the channel capacity C{W) = maxQ I{Q,W). For 
discrete channels, the channel T/F(?/|a;) is sometimes presented 
as a matrix where Vt^(y|a::) is in the a;-th column and the y- 
th row. Logarithms and all information quantities are base 2 
unless specified otherwise. 

We denote by Ax the unit simplex Ax = {Q ■ 
J2xex Qi^) = 1}' i-S- the set of all probability measures on 
X. 

Ber(p) denotes a Bernoulli random variable with probability 
p to be 1. Ind(-) denotes an indicator function of an event or a 
condition, and equals 1 if the event occurs and otherwise. We 
use ". . ." to denote simple mathematical inductions, where the 
same rule is repeatedly applied, for example a„ < n • a„_i < 
. . . < n! • flQ. 

A hat □ denotes an estimated value, and a line □ denotes 
an average value. The empirical distribution of a vector x of 
length n is a function representing the relative frequency of 
each letter. 
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where the subscript identifies the vector. The conditional 
empirical distribution of two equal length vectors x, y is 
defined as 
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B. Problem setting 

Let X, y be sets defining the input and output alphabets, 
respectively. Both X, y are assumed to be finite, unless stated 
otherwise Q Let be a sequence of memoryless chan- 

nels over n channel uses. Each Wi is a conditional distribution 
Wi{y\x) where x G X and y E y represent an input and 
output symbol respectively. The conditional distribution of the 
output vector Y given the input vector X is given by; 



Pr(Y\X)^l[W,iY,\X, 



(3) 



The sequence of channels Wi is arbitrary and unknown to 
the transmitter and the receiver. We assume the existence of 
common randomness (i.e. that the transmitter and the receiver 
both have access to some random variable of choice). There 
exists a feedback link between the receiver and the transmitter. 
To simplify, we assume the feedback is completely reliable, 
has unlimited bandwidth and is instantaneous, i.e. arrives to 
the encoder before the next symbol]^ We assume the system 
is rate adaptive, which means that the message is represented 
by an infinite bit sequence mg", and the system may choose 
how many bits to send. The error probability is measured only 
over the bits which were actually sent (i.e. over the first \nR\ 
bits, where R is the rate reported by the receiver). The system 
setup is presented in Figure [T] 

To simplify, we assume that there are no constraints on the 
channel input (such as power constraints). If such constraints 
exist they can be accommodated by changing the set of 
potential priors. 

Since the channel sequence is arbitrary there is no positive 
rate which can be guaranteed a-priori. Instead, we define a 
target rate R{Wi) as a function of the channel sequence T4^". 

Definition 1. We say that a sequence of rate functions R{Wi) 
is asymptotically attinable, if for every e, 5, A > there is n 
large enough such that there is a system with feedback and 
common randomness over n channel uses, in which, for every 
sequence the rate is R{Wi) — A or more, with 

probability of at least 1 — S, while the probability of error is 
at most e. 

In the next section we propose several potential target rates 
and then we would ask which of these are attainable. 



C. Potential target rates 

With respect to the sequence {Wi} we can define various 
meaningful information theoretic measures. The maximum 
possible rate of reliable communication is the capacity when 
the sequence is known a-priori (in other words, the capacity 
with full, non causal, channel state information at the trans- 



Note that the results in Section 
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do not require y to be finite 



mitter and the receiver) and is given by: 

n 

Ci(W^i") = max - y /(g„ W^,) 
{Q^} n f-; 



-V max/(g,W^,;) ^-yCiW^). 



(4) 



Note that if constraints on the sequence {Q^} existed, then we 
would have an equality ifTOl . The maximum rate that can be 
obtained with a single fixed prior when the sequence is known 
is: 

Lastly, the capacity of the time-averaged channel is: 

C3(M^r) = max/ ( Q, i V ly, ) = (6) 
where we define the time-averaged channel as 
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w{v\x)^-y^wM^ 

Ti ^ ^ 



(7) 



-The asymptotical results hold also when feedback is band limited and 
delayed. 



Clearly, Ci > C2 > C3 where the first inequality results 
from the order of maximization and the other results from 
the convexity of the mutual information with respect to the 
channel. For each of the above target rates we would like to 
find out whether it is achievable under the definitions above. 
As we shall see, C\ is not achievable, C3 is achievable, and 
C2 is achievable only under further constraints imposed on the 
problem. 

A rigorous proof that C\ is the capacity of the channel 
sequence is left out of the scope of this paper. For our purpose, 
it is sufficient to observe that C\ is an upper bound on 
the achievable rate, because the mutual information between 
channel input and output is maximized by a memoryless (not 
i.i.d.) input distribution n"=i Qii^i)- To see intuitively how 
Ci can be achieved, consider that since n can be arbitrarily 
large while the input and output alphabets, and thus the set 
of channels, remain constant, we may sort the channels into 
groups of similar channels, and apply block coding to each 
group. A close result pertaining to stationary ergodic channels 
appeai-s in LIL (3.3.5)]. 

III. A SYNTHETIC "TOY" PROBLEM 

In this section we present a synthetic problem, which will 
help us examine the achievability of the target rates defined 
above in a simplified scenario, draw the links to universal 
prediction, and introduce the techniques that will be used in 
the sequel. 

A. Problem description 

We focus on the problem of setting a prior Qi at time 
i. We assume that at each time instance i, the system has 
full knowledge of the sequence of past channels W^^^. The 
prior prediction mechanism sets Qi based on the knowledge 
of Wl~^. Then, we assume that I{Qi,Wi) bits are conveyed 
during time instance i. A predictor Qi{Wl^^) attains a 
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Fig. 1. A rate adaptive system with feedback 



given target rate i?(W") if for all sequences VF" we have 
k Eti HQnW,) > RiW^) - 5n , and ,5„ 0. 

This abstract problem can apply to a situation where the 
channel sequence is constant during long blocks, and changes 
its value only from block to block, or from one transmission 
to another In this case i denotes the block index, and denoting 
by m the constant block length, at most m-I{Qi, Wi) bits can 
be sent in block i. If the channel is constant over long blocks 
it is reasonable to assume that past channels can be estimated. 
Note that in addition we made the assumption that I{Qi,Wi) 
is achievable, although this communication rate is unknown to 
the transmitter in advance, i.e. we ignored the problem of rate 
adaptation. Therefore the synthetic problem is a subset of the 
original problem and upper bounds that we show here apply 
also to the original problem. 

B. Classification as a universal prediction problem 

We begin by discussing the achievability of C2 for the 
synthetic problem. The target rate C2 is special in being an 
additive function for each value of Q. Universally attaining 
C2 under the conditions specified above, falls into a widely 
studied category of universal prediction problems ifTZl . iTOI . 
041, 1 15|. Below, we present this class of problems and review 
some results that will be important for our discussion. 

These prediction problems have the following form: let 
& G i3 be a strategy in a set of possible strategies B, and x ^ X 
be a state of nature. A loss function l{b,x) associates a loss 
with each combination of a strategy and a state of nature. The 
total loss over n occurrences is defined as L = ^27=1 ^(^i' ^i)- 
The universal predictor bi{x]^^) assigns the next strategy 
given the past values of the sequence, and before seeing the 
current value. There is a set of reference strategies 
(sometimes called experts), which are visible to the universal 
predictor The target of universal prediction is to provide a 
predictor bi which is asymptotically and universally better than 
any of the reference strategies, in the sense defined below. 

For a given sequence x", denote the losses of the universal 
predictor and the reference strategies as L — 
and Lk = X^ILi ^(^i^^^ -^O' respectively. Denote the regret 
of the universal predictor with respect a specific reference 
strategy as the excessive loss: 

7^(fc) ^L- Lk. (8) 

TZk is a function of the sequence x" and the predictor. The 
target of the universal predictor is to minimize the worst case 



regret, i.e. attain 

'T^minimax — min max max 7?.(fc) . (9) 

The reference strategies may be defined in several different 
ways. In the simplest form of the problem the competition is 
against the set of fixed strategies b^^^ — b[k). The exact min- 
imax solution is known only for very specific loss functions 
|13, §8], and a solution guaranteeing maxx".fe 72.(A:) — !• is 
not known for general loss functions. However there are many 
prediction schemes which perform well for a wide range of 
loss functions (see references above). 

In the information theoretic framework, the log-loss 
l(b,x) — log (^fj^^, where b(x) is a probability distribution 
over X is the most familiar loss function, and used in ana- 
lyzing universal source encoding schemes [12J, since l{b,x) 
represents the optimal encoding length of the symbol x when 
assigned a probability b{x). It exhibits an asymptotical mini- 
max regret of ^7?.minimax — O (^^-^^^ ■ However in the more 
general setting the asymptotical minimax regret decreases in a 
slower rate of ^T^minimax ~ ^ (^)- There are several loss 
functions which are characterized by a "smoother" behavior 
for which better minimax regret is obtained |13, Theorem 3.1, 
Proposition 3.1]. For some of these loss functions, a simple 
forecasting algorithm termed "Follow the leader" (FL) can be 
used |13, §3.2] [16, Theorem 1]. In FL, the universal forecaster 
picks at every iteration i the strategy that performed best in 
the past, i.e. minimizes the cumulative loss over the instances 
from 1 to i — 1. 

The archetype of loss functions for which it is not possible 
to obtain a better convergence rate than O is the 

absolute loss l{b,x) — \b ~ x\, where x E X = {0,1} 
and b G B = [0, 1]. The proof for the lower bound on the 
minimax regret llT3l Theorem 3.7] is based on generating the 
sequence x" randomly, and calculating the minimum expected 
regret (over x). This value is a lower bound for the minimum- 
maximum regret (|9|. To show that the regret is uj{y/n) it is 
enough to consider only two competitors - one forecasting a 
constant zero, and one a constant one, and observe that since 
the cumulative losses of the two competitors always sum up 
to n, the minimum loss of the two competitors is a random 
variable with a standard deviation of 0(^/?i) which is upper 
bounded by ^, and therefore its expected value is ^ — C>(i/n), 
whereas the expected loss of the best single strategy over 
the random sequence cannot be better than ^. We will use a 
similar idea to prove lower bounds on the regret in the current 
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problems. For general loss functions, and specifically for the 
absolute loss, the simple FL strategy does not converge. 

The problem of asymptotically attaining C^iy^x) is anal- 
ogous to the standard prediction problem, where the prior 
Qi represents a strategy, and the channel Wi represents a 
state of nature. Our problem is given in terms of gains rather 
than losses, so we may consider the loss to be 1{Q^W) — 
— The regret is therefore: 



7^„(g) 



i=l 



(10) 



Note that the regret is defined in terms of bits rather than rates 
(i.e. it is not normalized), from technical reasons. 

C. A lower bound on the regret 

A natural question to ask is, then: what is the asymptotical 
form of the minimax regret expected in our case? As we 
will show, the prior prediction problem we posed, includes 
as a special case the prediction problem with the absolute 
loss function. Therefore, the asymptotical behavior cannot be 
better than 0{y/n), and it is not possible to apply the simple 
FL strategy. 

The following example shows why the problem of attaining 
C2 includes as a particular case the absolute loss function: 

Example 1. Consider the quaternary to binary channel (| A"! = 
4, |3^| = 2), which may be in one of two states s e {0, 1}, 
which define two conditional probability functions (shown as 
1 3^ I X \X\ matrices below): 



Wo{Y\X) = 
Wi{Y\X) = 




1 

1 

! 
2 



(11) 



By writing the input as two binary digits X ~ [Xi^X2], 
the channel can be defined as follows: if X2 — s then Y — 
Xi, otherwise, Y = Ber(i). These channels are depicted 
in Figure [2j where transitions are denoted by solid lines for 
probability 1, and dashed lines for probability ^. We consider 
the same prediction problem, under the simplifying assumption 
that the channel Wi = Wg^ is chosen only between the two 
channels above, and the forecaster knows this limitation, i.e. 
only the sequence of states Si E {0, 1} is unknown. 

It is clear from convexity of the mutual information, and the 
symmetry with respect to Xi (interchanging the values of Xi 



leads to the same mutual information), that any solution can 
only be improved by taking a uniform distribution over Xi. 
Therefore, without loss of generality, the input distribution Q 
can be defined by a single value q — Pr(X2 = 1) G [0,1], and 
be written Q = — q), ^(1 — q), ^q, ^q]. For this choice 
the output will always be uniformly distributed Ber (^^y We 
have: 

I{Q,Wo)^H{Y)~H{Y\X) 

= l-Y,Qi^)HiY\X = x) = l-q, ^12) 

and similarly /(Q, Wi) = q, therefore we can write: 

I{Q,Ws) = l-\s-q\. (13) 

Hence, even under this limited scenario, the loss function 
1 — I{Q, W) behaves like the absolute loss function, and 
therefore the normalized minimax regret (and the redundancy 

in attaining C2) is at least O ■ ' ^ 



Note that the relation to the absolute loss implies that the 
simple FL predictor Qi = argmax J2tZ\ I{Qi Wt), cannot be 

Q 

applied to our problem. An example to illustrate this and some 
further details are given in Appendix |L] 

Since in the rest of the paper we will focus on the rate 
function C3, it is interesting to note that, although this rate 
is smaller, in general, than C2, the minimum redundancy in 

obtaining it cannot be better than O (^/^) ■ To show this, we 
only need to show that in the context of the counter-example 
shown above, C2 = C3. For a specific sequence of channels, 
denote by p the relative frequency with which channel Wi 
appears. The averaged channel is (1 — p)Wo +pWi. It is easy 
to see that the capacity of this channel is obtained by placing 
the entire input probability on the two useful inputs of the 
channel that appears most of the time. That is, if p > 5 we 
place the input probability on the useful inputs of Wi and 
obtain the rate p ■ C{Wi) = p, and otherwise obtain (1 — p) • 
C{Wo) = 1 — p. Hence the capacity of the averaged channel 
is C3 = max(p, 1 — p). On the other hand, 

C2 - max ((1 - p) ■ I{Q, Wo)+p- I{Q, W^)) 

(14) 

— max ((1 — p) ■ (1 — q) + pq) — max(p, 1 — p). 

Using the example above, we can also see why Ci is not 
universally achievable with an asymptotically vanishing nor- 
malized regret by a sequential predictor. In the example, the 
capacities of the two channels are C{Ws) = 1. Suppose 
the sequence of channel states s" e {0, 1}" is generated 
randomly i.i.d. Ber (|). Then for any sequential predictor of 
q, the expected loss in each time instance is E[/((3,VKs)] = 
i(l — q) + = while the target rate is Ci — 1. Therefore 
the expected normalized regret with respect to Ci is i, and 
the maximum regret (maximum over the sequence {VF^}) is 
lower bounded by the expected regret. 

To summarize, we have seen why Ci is not universally 
achievable, and therefore C2 constitutes a reasonable target. 
Furthermore, the minimax regret with respect to C2 is at least 
O (^y^)' ^iid the simple FL predictor following the best a- 
posteriori strategy does yield a vanishing regret. 
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fact the space of reference predictors is continuous (it results 
from Lemma [2] stated below), but we do not know if this is 
the best convergence rate. 




Fig. 3. An illustration of exponential weighting. The triangle represents 
the unit simplex. The two peaks represent two priors Q which have a 
relatively large gain X]t=i ^(Q> Wi)- The weight function Wi{Q) combines 
them exponentialy, and the predictor Qi (represented as a black spot) is the 
weighted average. 



E. Proof of Theorem |7] 

In this section we analyze the performance of the predictor 
([16]) and prove Theorem [T] Define the instantaneous regret 
ri{Q) and the cumulative regret TZi{Q) as functions of Q: 



n{Q)^IiQ,W,)-I{Q^,W,), 



(20) 



D. A prediction algorithm 

The prediction algorithm proposed below is based on a 
well known technique of a weighted average predictor, using 
exponential weighting |fT3l §2.1]. A minor difference with 
respect to known results is the extension to a continuous set 
of reference strategies. 

A weight function w{Q) is any non-negative function w : 
Ax -^R'^ with J^_^ w{Q)dQ = 1. AW integrals in the sequel 
are by default over Ax- 

Define the following weight function: 



WiiQ) 
and the predictor: 



g, = / Q-w,{Q)-dQ. 

J Ax 



(15) 



(16) 



The weighting function gives a higher weight to priors that 
succeeded in the past and the predictor averages the potential 
priors with respect to the weight. This is illustrated in Fig. [3] 
The following theorem gives a bound on the regret of this 
predictor, which is proven in the next section. 

Theorem 1. Let I{Q, W),Q e Ax be bounded function < 
I{Q.,W) < /max which is concave in its first argument. Then 

< e~^, the predictor defined 



for n large enough so that 
by ( [15] ) and ( [T6] l with -q 



\n(r. 



_ ^ l\X\\nn 



■I- 



yields 



n 

n ^ — ^ 



with 



S = 2/„ 



\X\ - l)\nn 



(17) 



(18) 



Note that the theorem applies to gain functions more general 
than the mutual information, since it uses only the properties 
of concavity and boundness. In the case of mutual information 
we have 

/max = log mindA-l, 13^1). (19) 

We obtained a convergence rate of O (^\fH^^ which is 

slightly worse than the asymptotic bound of O from 
Section III-C The additional \/\nn may be attributed to the 



^.(Q) = E = E ^(Q' - E HQr,Wt). (21) 

t=l t=l f=l 

These functions express the regret with respect to a fixed 
competing prior Q. The claim of the theorem is equivalent to 
the claim that for all Q, TZn{Q) < nS. We sometimes omit 
the dependence on Q for brevity. 

For 77 > of our choice, we define the following potential 
function: 



(22) 



where m : A;^ — >■ R is an arbitrary function defined over the 
unit simplex. Note that for large values of 77 • u, $(w) approx- 
imates maxQ(u). As customary in this prediction technique, 
the proof consists of two parts: 

1) Bounding the growth rate of ^{TZi{Q)) over i = 
1,2, ... ,n for any Q. 

2) Relating maxQ{7^„(Q)} to $(7^„(Q)). 

The techniques we use are based on Cesa-Bianchi and Lugosi's 
II13II (see Theorem 2.1, Corollary 2.2, Theorem 3.3). 

From the concavity of I{Q, W) with respect to Q we have 
that for any weight function and any Wi'. 



w{Q)r,{Q)dQ= / w{Q)I{Q,W^)dQ- I{Q^,Wi^ 



<ll Jw{Q)QdQ,wA -I{Q^,W^) 



(23) 

Following 1 13 1 we term this inequality the "Blackwell condi- 
tion". The meaning of this condition is that by choice of 'w{Q) 
we can prevent an increase of TZi{Q) in a chosen direction 
(w{Q) can be thought of as a unit vector in the Hilbert space 
of functions over A;^). For the specific choice of the weight 
function ([Tsj, this direction is proportional to the gradient of 
$(/?) with respect to R, thus preventing any growth in this 
direction and leaving only second order terms that contribute 
to the increase of (Q)). Since the factor X]t=i ^iQi^ ^t) 
in ( [2T] ) does not depend on Q, the weight function ([15]) can 
be alternatively written as: 



Wi{Q) 



g»?K.-i(Q) 



(24) 
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Wi{Q) is indifferent to any constant addition to TZi^i{Q) 
due to the normalization. The growth of the potential can be 
bounded as follows: 



dQ 



(25) 



dQ, 



Notice that < /max- We take 77 small enough that tjt^ ^ 
V^max < 1 and use the following inequality (proven in 
Appendix |E]l: 

Lemma 1. For a; e [—1,1]; 



1 + a; < < 1 + a; + 
Returning to p5| ) we have: 

m,(Q)e'"'-dQ f y"«;,(Q) (l + ryr. + irlr^f) dQ 
w{Q)dQ + T] I w{Q)ndQW I w{Q)r1dQ 



(26) 



<0,(23) 

Therefore recursively applying p5| ): 



(27) 



2 r2 2 r2 

s" ■f»-$(7^„_l) < ... < e"" ^" 



$(0). 
(28) 

Notice that <i>(0) = / IdQ = vol{Ax)- This completes the 
first part of showing that the increase in $(7?.„) is bounded. 
For the second part we shall use the following lemma which 
relates the exponential weighting of a function to its maximum, 
and is proven in Appendix [Aj 

Lemma 2. Let F(x) be a real non-negative bounded function 
F : S ^ [a,b] concave in S, where S is a closed convex 
vector region of dimension d, and let rj satisfy r]{b — a) > d, 
then 



maxi^fx) < — In 
xe5 ri 



= - In 



vol(S') 



d J / — a) 
V \ d 



m 



d ^ / r]e{b — a) 
V \ d 



(29) 



Let F{Q) = TZniQ)- In this case the convex region is Ax 
and therefore d — dim(A;t) — \X\ — 1. By ( pTj i we can bound 
F by: 

n n 

- J2 HQ^, Wi) < F{Q) < nl,,,,^ - HQr, Wi), (30) 

i=l i=l 



where the factor X]"=i ^(*3i; ^i) constant in Q. We have 
b—a — nln^iix- Assuming rinlj^ax > dto satisfy the conditions 
of Lemma [2j we obtain from (|29|: 

77 <I'(0) J] \ d 



2 

max 



< 71774 



77 V d 
-ln(7i) ^ A, 



(31) 



where in the last inequality we assumed ^^^^"^^ < 1 (this 
would hold for 77 small enough). We use the following lemma 
to optimize the RHS of ( |3T| with respect to 7;: 

Lemma 3. The unique minimum over t G of f{t) — 

a-f^ + b-t-l^ fa, 6, a, /3 > 0) is obtained at t* = (^^) 
and equals 

fin 







IH 


a 


(0 













a ■ t 



(32) 



we have 



Particularly, for a ~ (3 = 1, i.e. f{t) 
t* = ^ and f{t*) = 2Vab. 

The proof of the lemma is simple by a direct derivation (see 
Appendix |E]). Applying the lemma to the optimization of 77 in 
pT| ) we obtain: 



V = 



/rfln(n) 



^-^max 



and 



A* = A 



= 2/„iaxV^rfnln(77). 



(33) 



(34) 



We now verify the assumptions we made along the way. 
In ( p7| ) we assumed that 7;/max < 1. If the contrary holds 
Vdmax > 1 then considering the first term in the RHS of ( (3T] i, 
we have A > nlynax, and therefore the theorem holds in a 
void way. To apply Lemma |2] we required rjnl^g^^ > d. If the 
opposite is true, i.e. rinltnax < d then the second term the RHS 
of ( (3T| ) becomes ^ ln(77) > 77/i„ax lii(fi), and so for n > e we 
would have again A > 7i/,nax and the theorem will hold in 
a void way. Thus for the two last conditions, it is enough 
that 77 > 3, since in this case if either of the conditions does 
not hold, the theorem becomes true automatically (in a void 
way). Lastly, in ([31) we assumed '''^^J"'" < 1. Substituting 



we have 



ln(n) 
dn 



< e 



ln(n) 



which 



becomes smaller than 1 for 77 large enough. The last condition 
supersedes n > e, and is specified as a requirement in the 
theorem. □ 

IV. Arbitrary channel variation 
In this section we return to the problem defined in Sec- 



tion 



II-B and present the main results of the paper: the 
achievability of the capacity of the averaged channel, and 
a converse showing that this is the best rate, under some 
conditions. We give the outline of the communication system 
attaining this rate, while leaving out some of the technical 
details, such as decoding and channel estimation (these will 
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be completed in the next section). We show that under abstract 
assumptions, the system achieves the desired rate. The same 
communication scheme and predictor will be used, with slight 
modifications, to prove the main result in Section |V] 

A. Target rate 

The synthetic problem differs from the problem defined in 
in two main aspects: 



Section II-B 



1) It assumes that the sequence of past channels is fully 
known. Since the receiver observes only one output 
sample from each channel, this assumption is not real- 
istic. On the other hand, the time-averaged channel over 
"large" chunks of symbols can be measured. 

2) It assumes that a rate corresponding to a sum of the 
per-symbol mutual information can be attained, whereas 
with an arbitrarily varying channel, the amount of mu- 
tual information between the input and output vectors is 
potentially lower. 

Therefore, as we shall see, C2 is no longer achievable in 
the context of the arbitrarily varying channel defined in Sec- 
tion II-B In Appendix IH] we show that even imposing on the 



synthetic problem only the limitation that the past channels are 
not given, but need to be estimated, leads to the conclusion 
that C2 is not attainable. Therefore we compromise on an 
alternative target: obtaining C3 = C{W), i.e. the capacity of 
the averaged channel. As we shall show in this section and 
the next, this rate is indeed asymptotically achievable. 

The rate C{W) is certainly not the maximum achievable 
target rate. As an example, if C{W) is achievable for large 
n then by operating the same scheme on two halves of 

the transmission time one could attain i? = {w^^'^^ + 

\C \ W2/2+i)' where W^"^^: denote the averaged 

channels on the two halves. This rate is in general higher, 
because due to the convexity of the mutual information 
with respect to the channel C{W) = maxg I{Q, W) < 

maxQ [i/ (q, + iJ (q, W^:/2+i)] < R- 

On the other hand, C{W) is the maximum achievable rate 
which is independent of the order of the sequence {Wi}, or, in 
other words, which is fixed under permutation of the sequence. 
This observation is formalized in the following theorem: 

Theorem 2. Let i?(W^") (for n = l,2,..j be a sequence 
of rate functions, which are oblivious to the order of W". 
If the sequence is asymptotically attainable according to 
Definition [7] then there exists a sequence (5„ — > such 

I ] n— >-C30 

that R(Wf) < C{W) + Sn- 

Note that C{W) depends on n through the average over n 
channels {VF^}". Since both Ci and C2 are oblivious to the 
order of W^, Theorem [2] implies they are not achievable. 

Following is a rough outline of the proof. Consider the chan- 
nel generated by uniformly drawing a random permutation 
TT of the indices i = 1, . . . ,n, using the channels Wi in a 
permuted order If a system guarantees a rate which 
is fixed under permutation, then this rate would be fixed for 
all drawing of tt, and therefore for the channel we described. 



the system can guarantee the rate R{Wi ) a-priori. Hence, the 
capacity of this channel must be at least R{Wi). The next 
stage is to show that the feedback capacity of this channel 
is at most C{W). Due to the fact we select the channels 
from the set {VFijf^i without replacement, the proof is a 
little technical and will be deferred to Appendix [F] However to 
give an intuitive argument, if we replace the channel described 
above, by a similar channel, obtained by randomly drawing at 
each time instance one of this time with replacement, 

then this new channel is simply the DMC with channel law 
W. Therefore feedback does not increase the capacity and its 
feedback capacity is simply C{W). The main point in the 
proof is to show there is no difference in feedback-capacity 
between the two channels, and the main tool is Hoeffding's 
bounds on sampling without replacement |17|. 

Another interesting property of the rate C{W) is that it 
meets or exceeds the random-code capacity of any memoryless 
AVC defined over the same alphabet, and thus a ttaining 
C(W) yields universality over all AVC's (see Section VI-Ai. 



Through the relation to AVC capacity we can see that common 
randomness is essential to obtain C{W), as it is essential for 
obtaining the random-code capacity IT]. 

After settling for C{W), the next question that naturally 
arises is: what is the best convergence rate of the regret, with 
respect to this target? In Section |III-C| we have shown that 
even in the context of the synthetic problem of Section III 
(with full knowledge of past channels), the regret with respect 

to C3 is at least 0{n^ 2 ), and this lower bound naturally holds 
in the current problem, where only partial knowledge of past 
channels is available. 

The following theorem formalizes claim that C{W) is 
achievable according to Definition [T| 

Theorem 3. For every e,6 > there exists N and a constant 
ca, such that ft? r any n > N there is an adaptive rate system 
with feedback and common randomness, where for the problem 



of Section II-B over any sequence of channels {Wi{y\x)Y^ 



1) The probability of error is at most e 

2) The rate satisfies R > C{W) ~ Ac with probability at 
least 1 — S 



3) Ac 



CA ■ 



Corollary 1. Specific values for e, 6, Ac can be obtained as 
follows. Let dejSo^cx > be parameters of choice. Then the 
constants n^^i^ and ca are given in the proof, by ( |114| l, dl 17 
where constants used in these equations are defined in 



19 



(|42]), (fT05]l-(fT07ll, ( fT09] l. For any n > Unun, e = n""' 

and S ^ e + Sq. 

Corollary 2. The same holds if Wi is determined (e.g. by 
an adversary) as a function of the message and all previous 
channel inputs and outputs X*~^, Y*~^. 

A numerical example is given after the proof (Example |2|i. 
The proof of the theorem is given in Section |V] 
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Qi = U Q2 



\ /LJV 

x,y x,y 



Predictor 



Fig. 4. An illustration of the combination of a rateless scheme with 
prior prediction. Each box represents a rateless block in which K bits are 
transmitted. 



B. The communication scheme 

In this section give the communication scheme, up to some 



details which will be completed later on (Section V-B 1. One 
of the issues that we ignored in the synthetic problem is the 
determination of the rate R before knowing the channel. To 
solve this problem we use rateless codes [18|. We divide the 
available time into multiple such blocks as done by Eswaran et 
al and in g). 

We fix a number K of bits per block. In each block, K bits 
from the message string are sent. At each block i = 1, 2, . . ., 
a codebook of exp(i^) codewords is generated randomly and 
i.i.d. (in time and message index) according to the prior Qi{x). 
Qi{x) is determined by a prediction scheme which is specified 
below. The random drawing of the codewords is carried out by 
using the common randomness, and the codebook is known to 
both sides. The relevant codeword matching the message sub- 
string is sent to the receiver symbol by symbol. At each symbol 
of the block and for each codeword x/,Z — 1, . . . , cxp(iir) 
in the codebook, the receiver evaluates a decoding condition 
( |59| l that will be specified later on. Roughly speaking, the 
condition measures whether there is enough information from 
the channel output to reliably decode the message. 

The receiver decides to terminate the block if the condition 
(|59]l holds, and informs the transmitter. When this happens, 
the receiver determines the decoded codeword as one of the 
codewords that satisfied ( |59l ). Then, using the known channel 
output y, and the decoded input x over the block which was 
decoded, the receiver computes an estimate of the averaged 
channel over the block. The specific estimation scheme will 
be specified in Section [V-B| 

The receiver calculates a new prior for the next block 
according to the prediction scheme that will be specified 
below. The receiver sends the new prior to the transmitter 
Alternatively, the receiver may send the estimated channel, and 
the new prior can be calculated at each side separately. The 
new block i + 1 starts at the next symbol, and the process 
continues, until symbol n is reached. The last block may 
terminate before decoding. 

C. The prediction algorithm 

In this section we present the prediction algorithm. We de- 
note by i the index of the block, and by Wi the averaged chan- 



nel over the block, i.e. if the block i starts at symbol ki and 
ends at - 1, then W,(y|a;) ^ ^-^^ Y!l=k~^ Wt{y\x). 
The length of the i-th block is denoted rrii = ki+i — ki. We 
use an exponentially weighted predictor mixed with a uniform 
prior. The motivation for using the uniform prior is explained 
in the next section. Let U = pry 1 be the uniform prior over 
X. We define the predictor as: 



where 

WiiQ) 



(l-A) / w,iQ)QdQ + XU. 



FAQ) 



(35) 



(36) 



where Fi{Q) is an estimate of the mutual information of the 
averaged channel over block i, I{Q,Wj), and is interpreted 
as an estimate of the number of bits that would have been 
sent with the alternative prior Q. This estimate is defined later 



on in Section V-E The parameters A, r/ and K will be chosen 
later on. $ is the potential function defined in ( p2| ). The term 
-^^f-^ normalizes Wi{Q) to J^^ Wi{Q)dQ = 1. 

The following Lemma formalizes the claim that the pre- 
dictor resulting of (|35|)-(|36l), asymptotically achieves a rate 

Lemma 4. Let Fi{Q), i — 1,. . . ,B + 1 be a set ofB + 1 non- 
negative concave functions of the prior Q G Ax, let {rrii^fA]^ 
denote a set of non-negative numbers, and K, n, /max be 
arbitrary positive constants satisfying n > e and K > 2/max- 
Define the target rate 

B+l 

Rt = max y —F,{Q). 

i—1 



Define the actual rate R over n channel uses as: 

KB 

n 



R 



(37) 



(38) 



Define the sequential predictor Qi as the result of i\35) and 
( |36l ). Let {rriijfJl^ satisfy: 



m,F,{Qi) < K. 



(39) 



Then for the value of r] specified below ( |43| l it is guaranteed 
that: 

R > min(i?T, /max) - Ap„d, 



where 



and 



K 



Ci 



ln{n) 



Ci =2V^-|A'|(|A'|-l)-/ma 

The value of rj attaining the result above is: 



\X\-1 



V = 



ln(n) • A 



(40) 

(41) 
(42) 

(43) 



The lemma is proven in Appendix [B| The proof uses similar 



techniques to those introduced in Section III-E however, 
different from the previous analysis, due to mixing with the 
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uniform prior, the "Blackwell" condition ((|23| in the previous 
case) only approximately holds. On the other hand, the use 
of the uniform prior enables relating Fi{Qi) to Fi{Q) for any 
other Q, and thus obtain from ( |39| l an upper bound on the 
gain miFi{Q) related to an alternative prior Q. The trade-off 
between the two is expressed in the two last factors in ( |4T| , 
one of which is increasing with A and the other decreasing. 

Since by ([39|, R > J2f=i "^FiiQ,) - f , the claim of the 
lemma appears similar to Theorem [T| with miFi{Q) taking 
the place of the function I{Q,Wi). However two important 
properties of the lemma, distinguishing it from the rather 
standard claim of Theorem [T| are that the bound does not 
depend on the number of blocks (i.e. the number of prediction 
steps), and that no upper bound on Fi{Q) is assumed. 

The rate /max represents a bound on mutual information, 
but in the context of the lemma it enough to consider it as 
an arbitrary rate that caps Rt- It affects the setting of 77 and 
the resulting loss. Also, n does not have to correspond to the 
actual number of symbols and serves here merely as a scaling 
parameter for the communication rate. The lemma sets a value 
of 77 but not for A, since A will have additional roles in the 
next section. 

D. Motivation for the prediction algorithm 

In this section a motivation for the prediction algorithm, 
and especially for the use of the uniform prior is given. Under 
abstract assumptions it is shown to achieve the capacity of 
the averaged channel. This section is intended merely to give 
motivation and is not formally necessary for the proof of 
Theorem [3] 

To simplify the discussion, let us make abstract assumptions 
regarding the decoding condition and the channel estimation: 
1) The decoding condition yields block lengths satisfying: 



TO,; < 



K 



(44) 



with an equality for all blocks except the last one which 
is not decoded. This implies the rate ^ equals the 
mutual information of the averaged channel. 
2) The averaged channels over all previous blocks are 
known and available for the predictor 
With these assumptions, the prediction problem can be consid- 
ered separately from decoding and channel estimation issues. 
Supposing that B blocks were transmitted, the achieved rate is 
R ~ Since n sa to^, using ( |44] i this can be written as 

~ (5 Yln=i /(Q ^W ) ) ■ target is to find a prediction 
scheme for Qi, such that for any sequence Wi, one will have 
R > C{W) — (5„ with 6n 0. There are two main difficulties 



compared to the prediction problem discussed in Section III 



1) The problem is not directly posed as a prediction prob- 
lem with an additive loss. 

2) The loss is not bounded: if for some i, I{Qi, Wi) — 
then the rate becomes zero regardless of other blocks. 

The first issue is resolved by posing an alternative problem 
which has an additive loss, and using the convexity of the 
mutual information with respect to the channel (as will be 



exemplified below in the abstract case). Regarding the second 
issue, notice that if the channel has zero capacity (always, 
or from some point in time onward), it is possible that one 
of the blocks will extend forever and will never be decoded. 
However we must avoid a situation where the channel has 
non-zero capacity (which our competition enjoys), while a 
badly chosen prior yields I{Qi, Wi) = 0. This may happen for 
example in the channels of Example [T| if the predictor selects 
to use the pair of inputs that yield zero capacity. If this happens 
then the scheme will get stuck since the block will never be 
decoded, and hence there will be no chance to update the 
prior. In addition, notice that selecting some inputs with zero 
probability makes the predictor blind to the channel values 
over these inputs. To resolve these difficulties we construct 
the predictor as a mixture between an exponentially weighted 
predictor and a uniform prior. We use a result by Shulman and 
Feder |19|, which bounds the loss from capacity by using the 
uniform prior U : 



I{U;W) > C-f3{C) > 



C 



(45) 



\X\-{l-e-^ 

where C is the channel capacity and /3(C) is defined therein. 
This guarantees that if the capacity is non-zero, then the 
uniform prior will yield a non-zero rate, and hence the block 
will not last indefinitely. 

Under the abstract assumptions made here, the following Fi 
is known and can be substituted in Lemma 21 



F,{Q)^I{Q,W, 
This yields the following result: 



(46) 



Lemma 5. For the scheme of Section IV-B under the abstrac- 
tion specified above, with n > 3 and K > 2/max ond properly 
chosen 77, A, the following holds: for any sequence of channels, 
the rate satisfies: 

KB 



R 



> C{W) 



(47) 



where C{W) is the capacity of the averaged channel and 

1 

'ln(n 



2 

^max 



2 1 
\X\3 ■ K3 



0, (48) 



where /max = log min(| A"!, |3^|). The parameters of the 
scheme rj, A required to attain the result are specified in ( |43[ ) 
and \\9\) respectively. 

Note that the bound ( |48] l is increasing with K, so it appears 
that that it can be improved by taking the minimal value of 
K. However in the actual system, there are be fixed overheads 
related to the communication scheme, and a large block size 
would be needed to overcome them. Taking any fixed and lar^e 
enough K, the normalized regret is bounded by 
which converges to zero, but at a worse rate than we had in 
Section UlLDl 

Note that the claims of Lemma |5] are stronger than the 
claims that appeared in the conference paper on the subject 
ifTOll . for the same problem, mainly in terms of the improved 
convergence rate with n. Also, the scheme used here is slightly 
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different than the one in the conference paper (in Equa- 
tion ([36|). The proof corresponding to the scheme presented in 
the conference paper can be found in an early version uploaded 
to arXiv ||20| . 

To prove Lemma |5] Lemma |4] is used with Fi defined 
in ( |46l l. The rate guaranteed by Lemma |4] is approximately 
Rt > 'EfJL^ irHQ^W,). Using convexity of the mu- 
tual information with respect to the channel this is at least 
I (Q>ESi^ ^^«) = ^ {Q,VV), and since this is true for 
any Q, the rate is at least C (W) . The detailed proof appears 
in Appendix |Gj 

V. Proof of the main result 

In this section we prove Theorem [3] regarding the attain- 
ability of C{W). The principles of the prediction scheme have 
been laid in the previous section, and here we plug-in a suitable 
decoding condition and a channel estimator. 

A. Preliminaries 

Suppose that during a certain block of length m we have 
used the i.i.d. prior Q{x). In order to estimate the channel after 
the block has ended and x was decoded, we use the following 
estimate: 



W{y\x) = 



Q{x) ' 



(49) 



where here and throughout the current section, x, y denote 
the m-length input and output vectors over the block, and 
^x,y (a;, y) is the empirical distribution of the pair (x^, yt) (for 
z = 1, . . . , m). The estimator is the joint empirical distribution 
divided by the (known) marginal distribution of the input 
X. Since we mix a uniform prior into Q{x) ( [35] l, all Q{x) 
are bounded away from zero, which makes the estimator 
(|49]l statistically stable, in comparison with the more natural 
estimator given by the empirical conditional distribution: 



W{y\x)^Py\^{x,y)^ 



(50) 



in which the denominator may turn out to be zero. A drawback 
of the proposed estimator (|49| is, that it does not generally 
yield a legitimate probability distribution, i.e. iy(y|a;) ^ 
1. The result of using this estimator is that in the calculations, 
we will see values that formally appear like probabilities 
but are not. To distinguish them from legitimate probabilities 
we term these values "false" probabilities, and mark them 
with a □. These functions usually approximate or estimate 
a legitimate probability. Formally, a false probability p{y) 
or p{y\x) can be any non-negative function of y or x^y 
(respectively). Note that until this point we did not need the 
assumption that the output alphabet y is finite, since the 
channel was given to the predictor rather than being estimated, 
and it is the first time this assumption is used. 

The function that we use as an optimization target for 
selecting the prior for the next block is, as before, the mutual 
information. The reason is that since our aim is to achieve the 
capacity of the averaged channels, the "competing" schemes. 



for each prior Q, achieve the mutual information of the aver- 
aged channel. However, since the estimates of past channels 
are false-probabilities, we need to define how to apply the 
mutual information to them. We do this by simply plugging- 
in the false channel into the standard formula of I{Q^W). 
This substitution results in what we define as the false mutual 
information I{Q,W): 



/(Q,Ty)^^g(x)W^(y|a;)log 



W{y\x) 



Y..'Q{x')W{y\x')^ 



(51) 

where cases of Q{x) = or T/F(?/|a;) are resolved using the 
convention • logO = 0. The following lemma shows that 
most of the properties of the mutual information function 
/(P, W) needed for our previous analysis in Section IV-C are 
maintained. 

Lemma 6 (Properties of false mutual information). The func- 
tion liQ, W) defined in ( |51[ ) is 

1) Non negative 

2) Concave with respect to Q 

3) Convex with respect to W 

4) Upper bounded by a ■ logjA"!, where 
a = maxa; J^y ^{y\x) . 

The proof is technical and appears in Appendix [C] In 
addition to the properties above, our proof relies on the next 
property which is more surprising. When the prior Q used 
for estimating the channel in ( |49l ) is the same prior Q used 
as input in ( |5T| , the false mutual information attains a form 
which is familiar from |,2 1 ) as a prototype of the zero order 
rate function. As in ||2TI . we use this form to obtain a bound on 
the probability of I{Q, W) to exceed a threshold for a random 
drawing of x. This bound, in turn, allows us to construct the 
rate-adaptive system attaining a block length rrii that depends 
on I{Q,W). 

Following [21] , we define conditional empirical probability 
of the discrete sequence x given the sequence y as p(x|y) ^ 
YiiLi Px\y{xi\yi)^ the probability of the sequence x under 
the conditionally i.i.d. distribution P{y\x) — P-^^y{y\x). Also, 
when vectors are substituted into Q we explicitly extend Q 
in an i.i.d. fashion, i.e. Q(x) = YYiLi Qi^i)- We will use the 
following result: 

Lemma 7 (False mutual information as a decoding metric). 

The false MI with prior Q{x) and W{y\x) — ^"q^^^!)'^^ where 
X, y are m-length vectors can be written as: 



IiQ,W) = l{Qix). 



-Pxy(a;,y) 
Q{x) 



Furthermore, for any Q and any y, when X is distributed i.i.d. 
Pr (l{Q, W) > T\y) = Pr ( > exp(mT) 



Q(X) 

< exp( — (mT — fco log m — fci)) 



where 



ko = h = - 1) ■ \y\. 



(53) 
(54) 
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Note that from the resuhs in 11211 Theorem 9?|^ (by using 
the result of the Theorem and the definition of intrinsic 
redundancy therein) we can obtain a tighter upper bound 
with ko — l-'^!''^!^!"-'-) log m (ko log m + ki = r,n where 
r„i is explicitly stated in fTP. Theorem 9?]). For the sake of 
simplicity we prove here a looser result above, as this does 
not change the asymptotical results significantly. 

Proof of Lemma The first part is shown by direct 
substitution. When W = "X, \ we have 

Q(x) 



x' x' 



(55) 



Therefore 



I{Q,W) = i[Q{a 



Q{x) 



ELP ^ g(^) log i 3^yi^ 



Q{x) 



M^)Py{y). 



^Ky{x,y) log 



Q{x) 



^ ni 

l^log 
m ^ — ' 



h\y{xi\yt) 



Q{xi) 



(56) 



As for the second claim, by Markov bound we have: 



- exp(mr) [ Q(X) 



(a) 



exp(~mr) 



p(x|y) 
Q(x) 



exp(-mT) E p(x|y), 



Q(x) 



(57) 



where in (a) we have used the fact X is distributed Q 
independently of y. To bound the sum above we split the set of 
sequences x to sub-sets having the same conditional empirical 
probability Px|y(2;|2/) (i.e. same conditional type Il22lll23l 
§11]). In a subset having Px|y(2;|2/) — p{x\y), the empirical 
probability p(x|y) = YliP{xi\yi) equals the (legitimate) prob- 
ability of the sequence under the i.i.d. distribution p, and as a 
result we have V a . , . , , . pfxly) < 1. The number of 
subsets is upper bounded (similarly to bounds on the number 
types f^T, Theorem 11.1.1] ) by which is upper bounded 
by (to + since e {O, i, ^, . . . , l} is 

^Reference is to be updated in tlie final revision. 



completely defined by {\X\ — 1) • |3^| integers in {0, ... , m}. 
E p(x|y) = E E p(x|y) 



P x:P^|y(x|jf)=p(a;|K) 

< (m + l)(l'^l-i)-l^l. 



(58) 



Substituting in ( [57] ) and using m + 1 < 2m yields the desired 
result. 



B. Decoding condition and estimated channel 

When the communication scheme was described in Sec- 
tion 



IV-B the details of the decoding condition and chan- 



nel estimation were omitted. These are specified below. At 
each symbol of the block and for each codeword x; , / = 
1, . . . , exp(A') in the codebook, the receiver evaluates the 
following decoding condition: 



(59) 



where /3 is a parameter to be specified later on, and the vectors 
x; and y are taken over the symbols of the block. 

Equivalently, by Lemma [7] the decoding condition can be 
written as: 

m-I{Q,,W)> PK, (60) 

where to is the number of the symbol in the block and iy(y|x) 
is a channel estimate according to (|49|, where x is substituted 
with the hypothesized input x/ and y is known output vector 
over the block. 

After decoding, the receiver sets the estimated channel Wi 
as the false channel iy(?/|a;) measured according to ( |49] l, 
where x, y are the to length vectors denoting the (hypothe- 
sized) input and output vectors over the duration of the block. 

To produce the next prior, this false channel is fed into the 
prediction scheme of Lemma with F,{Q)^I{Q,Wi) and 
where TOj denotes the length of block i. The parameters f3,i],\ 
(the latter are required for the prediction scheme of Lemma [4]) 
will be determined in the course of the proof. 

C. Proof outline 

The following proof outline conveys the main ideas in 
the proof, while some details were intentionally dropped, for 
simplicity. 

1) Using the results of Lemma |7] we show that the block 
lengths can satisfy the inequality ( (39| ) required by 
Lemma [4] up to a small overhead term in K, while 
still attaining a small probability of error. 

2) Operating the prior prediction scheme of Lemma Q 
with F.,{Q) = I{Q,Wi) as the metric with 

the measured channels, guarantees that if no errors 
were made, the rate achieved by the system ex- 
ceeds maxQ ^fj'i —I{Q, Wi) up to vanishing factors, 
where B is the number of blocks that were sent. 

3) Due to the convexity of the false mutual information 
with respect to the channel, the rate above exceeds 
maxQ /(g, Wa) where Wa = Y.f=i IT^^- 
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4) Since the rate above exceeds /^Q, Wa) for any Q, it 
exceeds C{Wa) — maxg I{Q,Wa)- 

5) All is left is to show the convergence in probability of 
Wa to the true average channel W, and by using the 
continuity of the capacity this proves the convergence 
in probability of C{Wa) to the capacity of the averaged 
channel C{W). 

6) In order to attain explicit bounds on the convergence rate 
we develop bounds relating the difference in capacity to 
the difference in the channels, and optimize the system 
parameters. 

Note that there are several delicate issues caused by the 
relations between Wi, rrii and Qi. For example, the correct 
operation of the prior predictor relies on the assumption 
of correct decoding which is required to obtain the correct 
channel estimators (i.e. that x used in ( |49] l is the true channel 
input). However, conditioning on the event of correct decoding 
changes the distribution of the average estimated channel 
Wa- Another example is that, although the convergence of 
X^S)^ — Wi to W appears to be trivial at first sight, the proof 
is complicated by the fact that the block lengths rrii are random 
variables, which themselves depend on the estimated channels 
Wi. One embodiment of this dependence is that the block 
would never end with an estimated channel which has zero 
capacity. Another dependence is between mi,Wi of different 
blocks, created through the prior prediction Qi. 

We start with a set of definitions and propositions formal- 
izing the claims made in the proof outline above. We use k 
to denote the symbol index and i to denote the block index. 
We denote hy i — bk the block index of a certain symbol 
(i.e. i = bk if symbol k belongs to block i). We define rrii 
(i = 1, . . . , -B + 1) as the length of each block including the 
last one. The last block is not accounted for in the rate, even 
if it is decoded. 

D. Error probability 

Proposition 1 (Error probability and decoding thresholds). 
For the value of /? given below ( |63| l, the probability of any 
decoding error occurring in any of the blocks is at most e. 

Proof: Consider a specific block and denote by m the 
number of the symbol inside the block. Since codewords other 
than the one which is actually transmitted are independent of 
X, y, the probability to decide in favor of a specific erroneous 
codeword X;, at any specific symbol k (i.e. that (|59]l will hold 
with respect to it), is upper bounded using (|53]l by; 



decoding error), is upper bounded using the union bound, by; 



Perri}'^ k) — Pr 



= Pr 



, p(X/|y) I 

PiMy) 



> cxp(r7ir)|y 
fcologm- fci)), 



(61) 



T=l3K/r. 



< exp{-{/3K 

where fco, fci are defined in Lemma |7] And by taking expected 
value over Y we have that the same bound holds when not 
conditioning on y. Since there are exp{K) — 1 competing 
codewords, and n symbols, the probability to decide in favor 
of any erroneous codeword at any symbol (i.e. to make any 



Perr < exp{K) ■ 71 ■ exp{—{/3K — fco logn — fci)) 
= exp(-((/3 - 1)K - (fco + 1) logn - fci)), 



(62) 



where we replaced log m by log n > log m. We now determine 
/3 so as to make the RHS equal e, and thereby guarantee the 
error probability is at most e; 

p^l^ log(e'^) + (fco + l)log?^ + fci ^g3^ 
K 

Note that with a suitable choice of K we would have /? — > 

n— f oo 

1 + . □ 



E. Attained rate 

The following lemma relates the rate to the averaged esti- 
mated channel Wa- 

Proposition 2 (Rate as a function of average estimated chan- 
nel). If there are no decoding errors, the rate of the scheme 
satisfies; 

i? = ^ > (1 - ^i) • min (C (Wa) , /max) - Ap„„ (64) 

where C (w^ = maxQgA;^ HQ^ ^) is the false capacity, 
Wa is the averaged estimated channel 



B+l 



WAiy\x) = - V m,W{y\ 



(65) 



i=l 



Ap„d is defined in Lemma |4] (for the relevant parameters 
n, K, A), and 



<5i 



1 

K 



log(e" 



(fco + 1) log n + fci + log 



\X\ 
A 



(66) 

Proof: Denote by wj^^\y\x) the channel estimate according 
to ( |49] l, taken over the symbols of the i-th block, with respect 
to the hypot hesize d input sequence X(. By our definition of 
W (Section V-B i, W = W'^''>{y\x) when / is the index of 



the correct codeword. Denote by W* the value of wj;^\i/\x) 
when I is the index of the hypothesized codeword. When there 
are no errors, W* — Wi. 

We use the prediction scheme of Lemma |4] with Fi{Q) = 
I{Q,W*). By Lemma |6] this choice satisfies the conditions 
of the lemma with respect to Fi{Q). Assuming there are no 
errors, we can equivalently write Fi{Q) = I{Q,Wi). 

We now use the decoding condition to show the require- 
ments of Lemma |4] with respect to the block length ( |39| ) hold. 

Denote by Wi and W^ , the channel estimates taken 
with respect to the true x over the first rrii — l symbols of the 
block z, and over the last symbol of the block, respectively. In 
other words, if block i spans symbols [fc^, li] where — fc^ + l = 
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rrii then to the channel (Lemma [6]) in order to relate the sum above to 

the capacity of the estimated averaged channel Wa- 

B+l / B+1 



W^(y\x) = J—- (67) 



■ Q^(^) E — ■ ^"W' W,)>iiQ,J2'^-wA^I (q, Wa) . 

wr\y\.) ^ SUind(x.^.>^^.) ^^^^ - " \ J ^^^^ 

(nii - l)Qi{x) Substituting in ([TSj we obtain: 



Ap„d 



(77) 



where in the equations above we wrote the empirical dis- = mm (c(Wa] , -^max ) — A ^od- 

tribution in (|49]l explicitly as a normalized sum of indicator V V / / 

functions. We currently assume > 1 and we'll return to Because the actual rate that the scheme achieves is not R but 

the case of = 1 at the end. From the above we have that: R = we have: 

WM^) = '^^^^Wl''\y\x) + — W,(^)(y|x). (70) R^R.^>^.mm(c (Wa),!....) ~ ^^l... (78) 

Since at symbol m, - 1 in the block , which is one Considering the second term, notice that the expression for 

symbol before decoding, none of the codewords satisfies ^v.AK) m Lemma |4j is sublineiu" in if, i.e. j^\,,^{K) is 

the decoding condition including the correct codeword decreasing with K, and therefore j^A^,^^^{K) < jjA^^^^{K), 

(which corresponds to the true channel input X), we have and we can replace the offset term in ([78) by A^^^,{K). 

As for the factor ^ we have 

(m,-l)-/(Q„W,(^)) </3if. (71) 

The same holds for the last block i = i? + 1. As for VFj , ^ 

from ^35[ we have that = 1 H 

1— f ^ 



— = /3 H log ^ — ' 

K ^ K ^\ \ 



A 



log(e-i) + (fco + 1) logn + fci + log (^^^ 



Q^{x) > rrn, (72) <5i 



(79) 



and because wj-^^ is measured on a single symbol, we can and by using ^ = > 1 - (5i we have the desired result, 
bound: □ 

- 1 <r Ino- I ''^ 



I (Qi, wj;^^) — log I ^ ^ 1 < log ^^^J ■ (73) F. Channel convergence 

We would now like to show the convergence of Wa to W. 
The equality above can be obtained using Lemma |7] or by As mentioned above, m, and Wi are statistically dependent. To 
definition using the fact that only for a single pair {x,y), avoid conditioning on to,, we first write Wa in an alternative 
W,^^\y\x) > 0. Combining ^ and ^ using ^ we have: ^rm. Plugging the expHcit form of W from ^ into the 

definition of Wa (|65]i, we have: 



B+l 



Wa^ -y^m,Wiy\x) 



TO, . / (q,. w) =m.-l( ft, + ^W^(^A ^ 1 

^ ^ \ nii J Wa ^ - 

< (to, - 1) • / (ft, W,(^)) + 1 . / (ft, 

(74) ^ l^ Ind(Xfc^a;,yfc^j/) 

In the case of to^ = 1, W, = W^^^ and ^ holds due " k=i QbA^) 

to (|73]). The last inequahty means the conditions of Lemma |4] Recall that the averaged channel is 
with respect to rrii are satisfied, with K replaced by K. Under „ 
the conditions of the lemma, it guarantees that: W — — Wk{y\x) 

KB f m- ^ ^ \ ~ ''"^ 

R = > min max — • I{Q, Wi), /,„ax - Ap„d, 

rj \ n ^ — ^ n j 

(75) 

where Ap,,,d = Ap^^^{K) is the offset defined in the lemma, jf^[x y) — — 
with K replaced by K. We use the convexity of / with respect 



(81) 



„ , Q z / n ; - We would like to show that Wa ~ W 0. Define 

\ z— 1 / n— foo 



n 



(82) 



15 



then 



Wa-W 



E 

k = l 



(83) 



Although 7fc(x, y) are not i.i.d., they constitute a bounded mar- 
tingale difference sequence, where the martingale is X]j=i 7j ' 
as we will show below. First, by ( |72] i, each component 
7fc(a;,y) is bounded < -fk{x,y) < ^l-^IA"^ = 7max, 
so they be bounded in absolute value by 7inax- On average 
over the common randomness, each symbol Xk is generated 
^ Qbk (x) independent of the past (given Qbj. (x)). In other 
words, for someone not knowing the specific codebook, the 
knowledge of past values of Xj^"^, YJ^^^ does not yield any 
information about when Qi,^ (x) is given. Define the state 
variable Sk-i = (x^-\ Yf-\ {g^J^-^ A Note that Qh, is 
only generated as a function of past symbols and therefore can 
be considered as part of the state at time k. We have: 

_ PiiXk ^x,Yk^ y\Sk-i) Wkiy\x) 



E 



lk{x,y) 



Sk- 



n ■ Qbk{x) 
Qb,{x)-Wk{y\x) Wk{y\x) 



(84) 



Now, since the previous value of the sum 7j '^^^Y ^ where 



function of S'fc-i, by applying the iterated expectations law 
we have 



E 







lk{x,y) 


E7. 











fc-i 




fc 1 1 




lk{x,y) 


S'fc_i,^7j 

















(85) 



-0, 



which shows X]j=i Ij ^ martingale. We can now apply 
Hoeffding-Azuma Inequality ^ A.1.3]||24||[T7l and obtain: 



{\wAiy\x)-W{y\x)\>t} 



Pr 



E 

k = l 



lk{x,y) 



> t 



< 2e 



(86) 



The above holds for each value of (a;, y) separately. To bound 
the Lr^ norm we use the union bound: 



Pr 



{\\WA-W\\oo>t] 

U ^WA{y\x)-W(y\x) 
\\^WA{y\x)^W{y\x) 



Pr 



<EPr 



> t 



>t\ 



(87) 



x,y 

_ 2\X\ ■ 13^1 -e^^i^. 
To guarantee the above holds with probabiUty at most 6q we 



choose t to make the RHS equal 5q: 



t — 5w = 




(88) 



This is summarized in the following proposition: 

Proposition 3 (Average estimated channel convergence). For 

any 5q > 0, and for Sw defined above. 



Pr 



{\\Wa - W\\ao > Sw} < So. 



(89) 



Observe that a large A improves the channel estimate 
convergence (reduces Sw), since it increases the minimum rate 
at which each input symbol is sampled. This is the additional 
role of A that we did not have in Lemma |5] 



G. Convergence of capacity 

The final step is to link the difference in the channels |j Wa— 
W\\ to the difference in capacities. For this purpose we use 
the following lemma: 

Lemma 8 {Lp bound on difference of false mutual information 
and capacity). Let Q{x) be an input distribution on the dis- 
crete alphabet X, W{y\x)^y ^ y a conditional distribution, 
and W{y\x) a false conditional distribution. Define 



\W{y\x)-W{y\x)\\p, 



(90) 



ll/(^,y)llp = 



Ei/(^'2/)r 

max|/(a;,y)| 



l/p 



p < oo 

P — QO 



(91) 



Assuming Ap < ^ we have 

vg : 

and 



IiQ,W)-I{Q,W) <2/p(Ap), 
C{W)-C{W) <2/p(Ap), 



where 



(92) 



(93) 



(94) 



For p ^ oo, by convention l/p = 0. Furthermore fp{t) is 
concave and monotonically non-decreasing for i < |- 

Note that the lemma is also true with respect to legitimate 
distributions. The proof of the lemma is based on Cover and 
Thomas' Li bound on entropy [23J, and Holder's inequality, 
and appears in Appendix |D] 



H. Main argument of the proof 

We now combine the results above as follows: Choose a 
value of (5o. We denote by E the event of any decoding error 
occurring in any of the blocks, and by D the event \\Wa — 
W^lloo > Sw- We use and overline □ to denote complementary 
events. 

Consider the event DDE. In this case, we have \\Wa — 
W\\oo < Sw and from Lemma [s] this implies 1(7(1401.) — 
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C{W)\ < Sc where Sc - 2f^{Sw) = -2dw ■ |3^|log(^w). 
From Proposition [2] we have that: 



R>il-6i)- min [C [Wa) , /maxj - A,„, 

> (1 - Si) ■ min (C(VF) - 6c, /max) - A,„, 

> (1 - ^i) • (min {CiW), /max) - 6c) ~ A,,,, 

= (1 - (50 • {CiW) - 6c) - A,,,, (95) 
= C{W) - 6i ■ CiW) - 5c • (1 - 5i) - Ap„, 

> C(W) - (<5i • /,„ax +Sc + Ap„,) . 



To summarize, if DdE then /i > C{W) — Ac- By the union 
bound and Propositions |3|1[ we have: 



Pt{R < C{W) - Ac} < Pt{D UE}< Pr{D} + Pr{E} 
<6o + e. 

(96) 

Note that although Lemma [8] is stated for general Lp norms, 
we have used it here only with respect to the Loo norm, since 
it is relatively simple to obtain bounds on the convergence 
of Wa — by using the well known Hoeffding-Azuma 
inequality per channel element (x, y) and the union bound. 
However as the distribution of Wa tends to a mutlivariate 
Gaussian distribution, using L2 norm seems to be more suited. 
Indeed, applying Lemma |8] with L2 norm, together with the 
(yet unpublished) bound on the L2 convergence of vector 
martingales due to Hayes |25| yields tighter bounds on the 
probability of having a small difference C{Wa) — C{W) for 
large alphabet sizes. 

/. Choice of the parameters 

We now substitute the numerical expressions for the various 
overheads, and set the parameters of the scheme to optimize 
the convergence rate. (5o,e are parameters of choice, and 
together with A, K they determine Ac. Our purpose is to 
choose A, K that will approximately minimize Ac. This part 
is rather tedious. We write Ac and collect all the relations 
below: 



and making this assumption, we have that the last element in 



5i is bounded by log (^^j < \ logn. Further assuming that 

ki < jfcologn (this holds trivially for the values of /coi^i 
of Lemma [tI when n > 2*), and e > ^ (for some arbitrary 
polynomial aecay rate dc) we have 



(51 < [de log(ri) + (fco + 1) \ogn + |fco log?i + \ logn] 



logri 
K 



(4 + |fco + i). 



(103) 



Using these bounds and extracting the constants we can upper 
bound Ac by: 



Ac < C2 



Inn C3 ln(7T,) 




hi(n) 

n 



K 



K 

1 

(5) 

(104) 

where element (1) stems from 5i, (2) from 5c and (3) — (5) 
from Apred, and the constants are: 



(4) 



C2 



(4 



/„ 



lege 



(105) 



C3 = \X\-\y\-\og{e) 



C4 



Cl 



K 



I) ■ In 



(107) 



As we shall see, element (5) is negligible. Therefore we first 
optimize the sum of (1) and (4) with respect to K, using 
Lemma|3] We write the sum as aK" + bK^^ with a = 5, /3 = 

l,a = C4-y/i^^^ ■ X' ^ ^ C2lnn. Since K is required to be 
integer, we write it as a function of a real valued parameter 
t: K = lt\, and assume t > 5. Then ^ < = < 

|i, and therefore aK°' + bR-^ < at"' + b{^)^ -t-^. By 

b' 

optimizing the bound with respect to t using Lemma p] we 
obtain 



Ac = 

61 = 

Sc = 

5w = 

Aprcd ^ 

Cl = 

Since 6w > 



61 ■ /max + Sc + \rad 

1 



K 



log(e i) + (fco + l)logn + fci 
X\ 



log 



A 



-2S w \ynog{6w) 
A V 2n I So 



K 

n 



/ln(n)^_i 
\ + ci\l—^\ 2. 



2^K ■\X\{\X\-l)-I, 



n 

max 



(97) 

(98) 
(99) 

(100) 

(101) 
(102) 



^, —2\og{6w) < logn, therefore 6c < 



\y\ log(n). To make S 



w 



we need 



\x\ 



b'l3 
aa 



where we defined 



'§0204^)^ •(A-nlnn)3 , (108) 



C5^(ic2C4-^)^ 



„ (32) 1 2 1 

aK" + bK-^ ^ 23| . a3 . (6')3 



i (\n^{n) 1\3 



n A 



(109) 



(110) 



Substituting in ( |104| l (and upper bounding element (5) by 
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(111) 



To determine A we notice that it is a trade-off between element 
(3) which is increasing in A and either (1) + (4) or (2) which 
are decreasing. Minimizing any combination separately (i.e. 
((1) + (4)) + (3) or (2) + (3)) using Lemma [5] yields the 



same decay rate O 



ln"(n) 



and A of the form 



A = CA 



In^(n) 



(112) 



Therefore this determines the best decay rate possible for 
( |1 \\\ . Note that we do not have to worry about the case A > 1, 
since in this case the term A/max in ( |104| l will exceed /max 
and Theorem |3] will be true in a void way. Substituting A we 
have: 



Ar- < 



C6 
1 

'-A 



1 



C3 / In {n) 
c\\ n 



1 1 

2 4 



(l) + (4) 



C5 



(2) 

Inn 

A TT' 



< 



< 



C6_ 
1 
^3 



CA 



C3 
5C2C| 



(3) 



C3 
Ca 



In^(n) 



(5) 
1 
4 



C5 • 



Inn 
n^ 



In^(n) 



(113) 



where in the last inequality we substituted the expression for 

Cg and assumed C5 • (^^) ^ < ^ In the last step we 

defined 



CA — 2 



5C2C4 
2 



1 

2 \ 3 



CA + 1. 



(114) 



CA / CA 

We now revisit the assumptions we have made along the 
way. 



In ( fTT3] l, we assumed C5 • (^)' < 
requires that (In n) 6 n 12 > C5 



In^ (k) 



This 



and a 



sufficient condition is n > ^|^^ . 

For ( |103| l we assumed ^ < y^. Substituting A leads to 

nln (n) > ^J-, and a sufficient condition is 

n>^. (115) 
CI 

For ( |103| l we assumed e > We may simply deter- 
mine df and set e = 

For ( |103[ ) we assumed fci < ifcologn, i.e. n > 
exp(4fci/fco) 

The application of Lemma [4] to obtain Proposition [2] 
requires that n > e and X > 2 • /max- Since K > K 
it is sufficient that K > 2/max, or > 2/,„ax + 1- 
Furthermore for ( |110| i we assumed > 5, so we 
require t* > max(2/max + 1,5). Substituting t* — 
C5 • (A • n Inn)^ = C5 • (n In^ n)^/^ > C5 • c^n-'^/'* leads 
to the sufficient condition: 

4 



> (max(2/max + 1, 5) • C5 ^ • C;^ 



(116) 



To summarize, the results holds for n > nmin where n,„in 
is the maximum of the conditions of ( |1 15| l,( [TT6| l, ( |103[ ) 
and of n > e: 



max 



lA-l 



-, ^max(2/n 



1,5) 



exp(4fci/fco) 



(117) 

This proves Corollary ([T]i. □ 
The claims of the Theorem are milder and are easily 
deduced from this Corollary. Given e,S, let So = 5 (5, and 
choose any > and ca > 0. Choose N large enough 
so that the error probability given by the Corollary satisfies 
e{N) — N^"^^ < min(e, ^(5), and N > nmin- This guarantees 
that for n > N, the requirements of the Corollary are met the 
error probability is e(n) < e, and the probability to fall short 
of the rate is at most e{n) + Sq < 6. This concludes the proof 
of Theorem [3] □ 
Following is a numerical example for the calculation of ca 
and nmin in Theorem [3] 

Example 2. For \X\ = 4, \y\ = 6, d, = 1 and Sa = l(r^° 
we obtain /max = 2 and C2 = 72.1, C3 — 127, C4 = 9.8, C5 = 
6.97. Choosing c-^ = 10 we obtain ca = 51.7 and nmin = 
min(e, 0.0256, 0.0123, 16) = 16. The convergence rate is 
rather slow and we have Ac < 0.2 only for n > 3.98 • 10^^. 

/. Proof of Corollary [2] 

During the proof of Theorem [3] we assumed the channel 
sequence is unknown but fixed. It is easy to see that the same 
proof holds even if the channel sequence is determined by an 
online adversary. 

The error probability (Proposition [T} is maintained regard- 
less of channel behavior, because the probabilistic assumptions 
made ( |6T] ) refer to the distribution of codewords that were 
not transmitted. Proposition |2] does not make any assumptions 
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on the channel as it connects the communication rate with 
the measured channel. The main difference is with respect to 
channel convergence. For the proof of Proposition[3]to hold we 
need to show that 7^ remains a bounded martingale difference 
sequence, which boils down to verifying ( |120| i still holds, i.e. 
that 7fc has zero mean conditioned on the past. Adding the 
message to the state variable Sk-i defined before ( |120| i, i.e. 
redefining = (x^\ Y^MQbJ^li bfV where hf 
is the message bit sequence, we have that ( |120| i holds even 
when the channel Wk{y\x) is a function of Sk-i- □ 



K. A result for channels with memory of the input 

Although channels with memory of the input are not con- 
sidered in this paper, the scheme presented above can be used 
over such channels as well. In this case, the performance of 
the scheme can be characterized as follows: 

Lemma 9. When the scheme ofTheorem^is operated over a 
general channel Pr(Y"|X"), the results of the theorem hold 
if the averaged channel is redefined as follows: 



W : 



1 " 

-^Pr(r, 



fe = y\Xk = a;,X 



k-l vA;-l\ 



(118) 



fc=i 



Note that for each pair x,y, Pr{Yk — y\Xk — 
X, X'''^^ , Y'^^^) is a random variable depending on the history 
■j^fc-i Y*:-!^ and therefore, different from the main setting 
considered in this paper, W is also a random variable. The 
definition above ( |118| l coincides with the previous definition 
of (|7| when the channel is memoryless in the input. This 
lemma is used in 1261 to show competitive universality for 
channels with memory of the input. 

Proof: As in the proof of Corollary [2] it is easy to see 
that assumptions on the channel apply only to Proposition [3] 
showing the convergence of the average estimated channel Wa 
to W. To show Proposition [3] holds, we need to show that 7^ 
remains a bounded martingale difference sequence, where now 
7fc is defined as: 



lk{x,y) ^ - 
n 



X, Yi. 



y) 



Qb^ (x) 
Pr(n = y\Xk 



(119) 



-fe-i 



rk-l\ 



As in ( |83| ), we have Wa — W — J2k=i^kix,y)- Equation 
(|120|) now becomes 



E 



Ik 



Sk^-i 



PrjXk ^ x,Yk ^ y\Sk-i) 
n ■ Qbkix) 



fe-l -\rk-l\ 



Qb,i=^) ■ PTjYk ^ y\Xk ^ X , x^-\ Y^-i^ 
n ■ Qhtix) 



n 



= 0. 



(120) 

The rest of the proof of Proposition [3] remains the same. □ 



VI. Discussion and comments 

In this section we discuss the relation of the current results 
to existing results pertaining to unknown channels and make 
some comments on schemes presented here. 



A. A comparison with AVC capacity 

It is interesting to compare the target rate C{W) with the 
AVC capacity. We will give a short background on the AVC 
and the relation to the current problem. 

In the traditional AVC setting 11], the channel model is sim- 
ilar to the setting assumed here, but slightly more constrained. 
The channel in each time instance is assumed to be chosen 
arbitrarily out of a set of channels, each of which is determined 
by a state. Frequently, constrains on the state sequence (such 
as maximum power, number of errors) are defined. The AVC 
capacity is the maximum rate that can be transmitted reliably, 
for every sequence of states that obeys the constraints. 

The AVC capacity may be different depending on whether 
the maximum or the average error probability over messages is 
required to tend to zero with block length, on the existence of 
feedback, and on whether common randomness is allowed, i.e. 
whether the transmitter and the receiver have access to a shared 
random variable. The last factor has a crucial effect on the 
achievable rate as well as on the complexity of the underlying 
mathematical problem: the characterization of AVC capacity 
with randomized codes is relatively simple and independent on 
whether maximum or average error probability is considered, 
while the characterization of AVC capacity for deterministic 
codes is, in general, still an open problem. Randomization 
has a crucial role, since we consider the worst-case sequence 
of channels. This sequence of channels is chosen after the 
deterministic code was selected (and therefore sometimes 
viewed as an adversary), enabling the worst-case sequence 
of channels to exploit vulnerabilities that exist in the specific 
code. As an example, for every symmetrizable AVC lIZTl 
Definition 2], the AVC capacity for deterministic codes is 
zero II27I Theorem 1]. When randomization does exist, the 
random seed is selected "after" the channel sequence was 
selected (mathematically, the probability over random seeds 
is taken after the maximum error probability over all possible 
sequences), and therefore prevents tuning the channel to the 
worst-case code. When randomization exists, the channel 
inputs may be made to appear independent from the point 
of view of the adversary, thus limiting effective adversary 
strategies. Therefore the results in the current paper assume 
common randomness exists. 

We would now like to compare the target rate C{W) with 
the randomized AVC capacity. The discrete memoryless AVC 
capacity without constraints may be characterized as follows: 
let W be the set of possible channels that are realized by 
different channel states (for example in a binary modulo- 
additive channel with an unknown noise sequence, there are 
two channels in the set - one in which y — x and another 
in which y = 1 — x). This set is traditionally assumed to be 
finite, i.e. there is a finite number of "states", however this 
constraint is immaterial for the comparison. The randomized 
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code capacity of the AVC is H] Theorem 2]: 



Cavc = max min I(Q, W) 

Q Weconv(W) 

min maxI(Q,W)= min C{W), 

Weconv(W) Q W6conv(W) 

(121) 



where conv(yV) is the convex hull of W, which represents all 
channels which are realizable by a random drawing of chan- 
nels from wj^In the example, conv(yV) would be the set of 
all binary symmetric channels. When input or state constraints 
exist, they affect ( |121| l simply by including in the set of Q-s 
and in conv(yV) only those priors, or channels, that satisfy the 
constraints (respectively). The converse of ( |121[ ) is obtained by 
choosing the worst-case channel W* = argmin C{W) and 

Weconv(W) 

implementing a discrete memoryless channel (DMC) where 
the channel law is W*, by a random selection of channels from 
W. Hence it is clear that the randomized code capacity cannot 
be improved by feedback. In contrast, the deterministic code 
AVC capacity can be improved by feedback, and in some cases 
made to equal to the randomized code capacity ll28l l|29l ll30l . 
Therefore, most existing works on feedback in AVC deal with 
the deterministic case. 

Since by definition W E conv(>V), we have from ( |121| l, 
C{W) > Cavc, i-S- om target rate meets or exceeds the AVC 
capacity. While in the traditional setting, a-priori knowledge 
of W or state constraints on the channel is necessary in order 
to obtain a positive rate, here we attain a rate possibly higher 
than the AVC capacity, without prior knowledge of W. This is 
important since without such constraints, i.e. when the channel 
sequence is completely arbitrary, the AVC capacity is zero. 
This property makes the system presented here universal, with 
respect to the AVC parameters, a universality which also holds 
in an online-adversary setting (Corollary |2]i. 

We can view the difference between Cavc ( |121| l and C{W) 
as the difference between the capacities of the worst real- 
izable channel W* G conv{W), and the specific channel 
W e coiw{yV) representing the average of the sequence of 
channels that actually occurred. This difference is obtained 
by adapting the communication rate to the capacity of the 
average channel, and adapting the input prior to the prior that 
achieves this capacity, whereas in the AVC setting, the rate 
and the prior are determined a-priori, based on the worst-case 
realizable channel. 

As we noted above, feedback cannot improve the random- 
ized AVC capacity. Therefore the improvement is attained 
not merely by the use of feedback, but by allowing the 
communication rate to vary, whereas in the traditional AVC 
setting, one looks for a fixed rate of communication which 
can be guaranteed a-priori (note that the improvement is not 
in the worst case). In allowing the rate to vary, we have lost 
the formal notion of capacity (as the supremum of achievable 
rates), thereby making the question of setting the target rate 
more ambiguous, but nevertheless improved the achieved rates. 



B. Relation to empirical capacity and mutual information 

The capacity of the averaged channel C{W) is a slight 
generalization of the notion of empirical capacity defined by 
Eswaran et al f3l §D]. The only difference is releasing the 
assumption made there, that the set of channel states is finite. 
The empirical capacity of Eswaran is in itself a generalization 
of the empirical capacity for modulo additive channels defined 
by Shayevitz and Feder |2|. Eswaran et al |3 1 assume the prior 
Q is given a-priori and attain the empirical mutual information 
I{Q, W). The scheme used here is similar to the scheme they 
presented in its high level structure. We can view the current 
result (Theorem [3]) as an improvement over the previous work, 
i.e. attaining the capacity C{W) > I{Q, W), rather than the 
mutual information, by the addition of the universal predictor 
Our result answers the question raised there [3, §D], whether 
the empirical capacity is attainable. 

Another small extension is in Corollary [2] showing that the 
result holds in an adversarial setting. This extension is outside 
our main focus of communicating over unknown channels, 
and is only used to strengthen the claim on universality with 
respect to the AVC parameters. 

The main result (Theorem [3]l could be derived in a con- 
ceptually simpler but crude scheme, by combining the results 
of Eswaran ||3J or our previous paper |4J with Theorem [T] 
The transmission time n may be divided into multiple fixed- 
size blocks i = 1,. . . ,N, and in each block, one of these 
schemes is operated, with an i.i.d. prior chosen by a predic- 
tor. Using Eswaran' s result, for example, and ignoring some 
details such as finite-state assumptions, one would obtain the 
rate I{Qi,Wi) over each block, where Wi is the averaged 
channel over the block. The channel Wi can be well estimated 
(e.g. using training symbols or using the communication 
scheme itself). Assuming it is known, if the prediction scheme 
of Theorem [l] is operated over Wi it will guarantee the 
average rate over the N blocks will be asymptotically at 
least jf^f^iI{Q,Wi) for any Q, and using convexity, 

Tf Eti HQ, W,) > / (g, ^ Eti W?j = I{Q, W). Since 
this holds for any Q this achieves the capacity of the average 
channel. Note that here it appears that there is no need for 
the uniform prior, however this is somewhat hidden in the 
assumption that the channel is known. Furthermore there is 
no need to worry about rateless blocks extending "forever" 
since the commnication scheme is re-started on each of the N 
blocks. 

C. Competitive universality 

In a related paper |5| we presented the concept of the 
iterated finite block capacity Cifb of an infinite vector channel, 
which is similar in spirit to the finite state compressibility 
defined by Lempel and Ziv |31 1. Roughly speaking, this value 
is the maximum rate that can be reliably attained by any 
block encoder and decoder, constrained to apply the same 
encoding and decoding rules over sub-blocks of finite length. 
The positive result is that Cifb is universally attainable for all 
modulo-additive channels (i.e. over all noise sequences). The 
result is obtained by a system similar to the one described 



"^The convex hull replaces the distribution over channel states in fT\. in Section IV-B while the input prior is fixed to the uniform 
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prior. The result uses two key properties of the modulo additive 
channel: 

1) The channel is memoryless with respect to the input Xi 
(i.e. current behavior is not affected by previous values 
of the input). 

2) The capacity achieving prior is fixed for any noise 
sequence. 

The current work is a step toward removing the second 
assumption. The capacity of the averaged channel is a bound 
on the rate that can be obtained reliably by a transmitter and 
a receiver operating on a single symbol, since the channel 
that this system "sees" can be modeled as a random uniform 
selection of a channel out of {VFi}"^]^, which we term the 
"collapsed channel" |5 1. By combining k symbols into a single 
super-symbol, we can extend the result and obtain a rate which 
is equal to or better from the rate obtained by block encoder 
and decoder operating over chunks of k symbols. Therefore 
the current result suggests that it is possible to attain Cifb 
for all vector channels that are memoryless in the input, i.e. 
that have the form defined in ([3|, for an arbitrary sequence of 
channels Wi (compared to only an arbitrary noise sequence, 
in the previous result). 

D. Notes on the converse 

It is interesting to consider the converse (Theorem |2]) 
from the following point of view: Suppose a competitor is 
given the entire sequence of channels W^, but is allowed 
to take from this sequence only the "histogram" (a list of 
channels and how many times they occurred), and devise a 
communication system based on this information. The rate 
that can be guaranteed in this case is limited by C{W). On 
the other hand, assuming common randomness exists, it is 
enough to know W in order to attain C{W) without feedback. 
To see this intuitively, we may apply a random interleaver and 
use the fact the interleaved channel is similar to a DMC with 
the channel law W . Therefore even if one knows the entire 
histogram of the sequence, the average channel W, which 
contains less information, contains all information necessary 
for communication. 

To illustrate this, consider the deterministic setting, where 
instead of a sequence of channel laws Wi{y\x) we have a 
sequence of deterministic functions fi : X ^ y. This is a 
particular case of our problem, with Wi{y\x) ~ Ind(y = 
fi{x)). Even in this case, according to Theorem |2] a com- 
petitor knowing the list of functions up to order, will not 
be able to guarantee a rate better than C{W), where W — 
i ^"^-^ Ind(?/ = fi{x)), i.e. a channel created by counting 
for each x, the normalized number of times a certain y would 
appear as output. 

Comparing the amount of information in the channel his- 
togram and the averaged channel in this case, there are |3^|l'^l 
functions, and therefore the distribution is given by |3^|l'^l — 1 
real numbers. On the other hand, the average channel is a 
probability distribution from \X\ to |3^| and is specified by 
(13^1 - 1) • \X\ real numbers. 

An interesting property revealed through the example, is 
that although the setting is deterministic, the result is given in 



terms of probability functions. These "probabilities" are only 
averages related to the deterministic function sequence, but this 
shows that the formulation via probabilities (or frequencies) is 
more natural than by specifying the function fi between the 
input and output. 

E. The required feedback rate 

We assumed the feedback channel has unlimited rate, and 
is free of delays and errors. This was done mainly to focus the 
discussion and simplify the results. It is clear from the scheme 
presented, that because the amount of information required to 
be fed back to the transmitter can be made small, the capacity 
of the average channel could be attained even if the feedback 
link has any small positive rate and a fixed delay. If the 
feedback channel is such that errors can be mitigated by coding 
with finite delay, then errors can be accommodated as well. 
Specifically, we show in Appendix [j] that when the feedback 
rate is limited, or there is a fixed delay, the penalty is a gap 
of at most O(logn) symbols between the blocks, and that the 
normalized loss from this effect tends to zero. Therefore we 
have Ac — > (with the notation of Theorem [3]), with any 

n— f oo l_J 

positive feedback rate and any fixed delay. The gap may be 
reduced by using the time of the i-th block to transfer the 
channel information from block i — 1 and use it only in block 
i + 1 (i.e. insert a delay of one block in the prediction scheme), 
however this approach is not analyzed here. 

F. Convergence rate 

Throughout the course of this paper, as we have gradu- 
ally made our assumptions more realistic, we have seen a 
deterioration of the rate of convergence, of the achieved rate 
to the target rate. We denote by (5„ the gap between the 
guaranteed rate and the target rate, and focus on the dominant 
polynomial power p = — lim„_j.oo i^^^, while ignoring the 
In n terms. We have p = ^ in the synthetic problem of 
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(assuming "block-wise" variation) 
when using the rateless sche me u nder assumptions of perfect 
average channel knowledge diVl and p ~ \ when releasing 
the abstract assumptions f|V] The first deterioration (between 
i and |) is mainly attributed to the rateless coding scheme. 
More specifically, it stems from mixing with the uniform 
prior, which is necessary to bound the regret per block when 
the blocks have variable lengths. The second deterioration 
(between | and i) can be attributed mainly to the fact that 
the number of bits per block K has to increase in a certain 
rate in order to balance overheads created by the universal 
decoding procedure (and reduces the rate of adaptation). While 
the rate of convergence which was achieved deteriorates, the 
only up per bo und we presented on the convergence rate is 
P < 



SIII-Ci, which is tight only for the first case. We do 
not know whether better convergence rates can be attained in 
Theorems |5|3| 



G. Comments on the prediction scheme 

The results in this paper were obtained by exponential 
weighting. This scheme was selected mainly due to its sim- 
plicity and elegance. Unfortunately, the exponential weighting 
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Synthetic problem 

("Block-wise 

variation") 


Arbitrarily varying channel, 
with side information on aver- 
age channel and without com- 
munication overheads 


Arbitrarily varying 
channel 


Notes 


Reference 


Theorem [ij 


!|IV-C| Lemma |5j 


^II-B|i]V| Theorem|3j 




Ci Attainability 




No 




Ci = Capacity of {VFi}" = Mean capacity 


C2 Attainability 


Yes 


No 


No 


C2 = Mean mutual information with fixed 
prior maxQ ^ J], HQ> ^i) 


C3 Attainability 


Yes 


Yes 


Yes 


C3 = Capacity of the time-averaged channel 
C(W) = C(iE,H/,) 

1) Best attainable rate not using time 
structure (Theorem |2). 

2) C3 > Cavc (Section I VI- A| 


Normalized 
regret lower 
bound 


1 


1 


1 




Nomialized 
regret attained 




1 







TABLE I 
Summary of the results 



is performed over a continuous domain (of probabilities), and 
therefore it is not immediately implementable. Of course, the 
simplest practical solution could be discrete sampling of the 
unit simplex and replacement of the integrals by sums. Since 
the mutual information is continuous, it is possible to bound 
the error resulting from this discretization. An alternative 
way is to quantize the set of priors. Instead of competing 
against a continuum of reference schemes, we first reduce 
the number of reference schemes to a finite one, by creating 
a "codebook" of priors {Qm}- This codebook is designed 
so that the penalty in the mutual information resulting from 
rounding to the nearest codeword, is small. This quantization 
is useful in terms of the feedback link, which now only has 
to convey the index m. Having quantized the priors, we may 
replace the predictors shown here by standard schemes used 
for competition against a finite set of references lfT3l S21. lfT5l . 
See a rough analysis of this approach in Appendix |l] An 
alternative approach is to bypass the explicit calculation of 
the predictor Qi and use a rejection-sampling based algorithm 
to generate a random variable X ^ Qi. This approach is 
demonstrated in Appendix [K| 

Zinkevich f32\ proposed a computationally efficient online 
algorithm, based on gradient descent, to solve a problem of 
minimizing the sum of convex functions, each revealed to the 
forecaster after the decision was made (a similar setting to 
that of Lemma|4]i. To apply Zinkevich's results to our problem, 
some modifications are required. The mutual information does 
not have a bounded gradient (which is required by |32|), but 
this could be bypassed by keeping away from the boundary 
of Ax, i.e. from these points for which one of the elements 
of (5 is or 1. One way to accomplish this is by mixing 
with the uniform prior when defining the target rate, and 
use maxQ^j/((l — X)Q + XU,Wi) as a target, and then 
bounding the loss induced by this mixture. In the rateless 
scheme, a bound on the maximum value of raiFi{Q) (of 
Lemma |4]i is required and can be obtained using the same 



methods presented here. 

Another application of sequential algorithms to solve prob- 
lems related to AVC's was proposed by Buchbinder et al ||33]| 
who used a sequential algorithm to solve a problem of dynamic 
transmit power allocation, where the current channel state is 
known but future states are arbitrary. 

H. The combination of the communication scheme with the 
predictor 

In the communication scheme proposed in Section [TV-B| we 
chose to use an i.i.d. prior during each block, and update the 
prior only at the end of the block. This choice is motivated 
by the following considerations: 

• Assuming no expUcit training symbols are transmitted, 
the estimation of the channel W is done based on the 
encoded sequence, which is known to the receiver only 
after decoding (at the end of the block). 

• Varying the prior throughout the block inserts memory 
into the channel input, which complicates the analysis. 

The result of this is a relatively slow update of the prior, 
essentially limited by the block size, which is determined 
based on communication related considerations (overheads 
and error probabilities). An alterative would be learning the 
channel through random training symbols (see for example 
|2 |), and updating the prior from time to time, without relation 
to the rateless blocks. 

/. The behavior of the regret for binary channels 

In Section IIII-CI we have shown a lower bound on the 
redundancy in attaining C2 by using a counter example with 
I'^l = 4, |3^| = 2. It is worth mentioning that for the set of 
binary channels \X\ — \y\ = 2, the normalized regret is not 

necessarily *^(\/^)- °f channels, the optimal 

prior does not reach the boundaries of [0, 1]: the two input 
probabilities Vi{X = x) are always in [e~^, 1 — e~^] ||T9l . It is 
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possible to show that the loss function 1{Q, W) — 1— ^^^'^"^ 
satisfies conditions 1,2,4 in Cesa-Bianchi and Lugosi's iDook 
fTS I Theorem 3.1 (but not condition 3). This fact together with 
experimental results showing convergence of the FL predictor, 
suggests that the normalized minimax regret in this case may 
converge like O i 



J. The uniform component in prior predictor 

In the prediction scheme of Theorems 5]3 we mixed a 
uniform prior with an exponentially weighted predictor ( (35] l. 
This mixing has two advantages: 

1) Enabling to bound the instantaneous regret caused by a 
large block due to a low mutual information 

2) Enabling channel estimation by making sure all input 
symbols have a non zero probability. 

Note that alternative solutions are use of training symbols 
at random locations and termination and re-transmission of 
blocks whose length exceeds a threshold. 

Mixing the exponentially weighted predictor with a uniform 
distribution is a technique used in prediction problems with 
partial monitoring, where the predictor only has access to 
its own loss (or a function of it) and not to the loss of 
the competitors |fT3l §6], and effectively assigns some time 
instances for sampling the range of strategies. In our problem 
the uniform prior plays two roles. One, is related to the 
rateless communication scheme, which required to relate the 
gains of the predictor to the gain of any alternative prior 
Q ( |134| i in order to have an upper bound on the latter 
( |135[ ). The second role is in the convergence of the estimated 
channel (Proposition |3]l. The second role is similar to the role 
of uniform distribution in partial monitoring problems: the 
channel 14^(2/ |x) cannot be estimated for input values x that 
occur with zero probability. 

Note that even without the explicit uniform component 
AC/, the exponential weighting element J Wi{Q)QdQ in ( |35| ) 
includes a small uniform component. Particularly, since refer- 
ring to (|36]l, 1 < e"^j;l'"r-r(Q.w^.) < e')"-f»-^ ^^^(q) > 



-rjnin 



and 



w,{Q)QdQ > e 



— rjnl-n: 



1 



vol{Ax) Ja;^ 



U. 



QdQ 



(122) 



However this value is too small for our purpose. 

K. Continuous channels 

In the current paper we assumed the input and output 
alphabets are finite. In general it is not possible to universally 
attain C2 or C3, even in the context of the synthetic problem of 



Section III when the alphabet size X is infinite. This is since 



in the continuous case one is trying to assign a probability 
Q to an infinite set of values, where the values producing 
the capacity may be a small subgroup. Consider the following 
example: 

Example 3. Let the channel Wa, with input x and output 
y (x,y E R) he defined by the arbitrary sequence {ak}j°, 



Ofc G E, with all Ui ^ ak{i k). The channel rule is defined 
by: 

k X = ttk 



y 



o.w. 



(123) 



For any sequential predictor (even randomized) we can find 
a sequence of channels {M^a} such that the values of the 
sequence {a;} at each step have total probability zero (since 
the input distribution may have at most a countable group of 
discrete values with non zero probability). Therefore we can 
always find a sequence of channels where the rate obtained by 
the predictor would be zero. On the other hand, each channel 
Wa has infinite capacity (since it can transmit noiselessly any 
integer number). Therefore the value of C2 is infinite (it is 
enough to choose a prior suitable for one of the channels in 
the sum (|5]l). 

It stands to reason that under suitable continuity conditions 
on and input constraints on Q{x), we may convert 

the problem to a discrete one, while bounding the loss in this 
conversion, by discretization of the input - i.e. by selecting 
the input from a finite grid, or alternatively assuming a 
parametrization of the channel. 

VII. Conclusion 

We considered the problem of adapting an input prior 
for communication over an unknown and arbitrarily varying 
channel, comprised of an arbitrary sequence of memoryless 
channels, using feedback from the receiver. We showed that it 
is possible to asymptotically approach the capacity of the time- 
averaged channel universally for every sequence of channels. 
This rate equals or exceeds the randomized AVC capacity of 
any memoryless channel with the same inputs, and thus the 
system is universal with respect to the AVC model. The result 
holds also when the channel sequence is determined adver- 
satively. We also presented negative results showing which 
communication rates or minimax regret convergence rates 
cannot be attained universally (see a summary in Table [l]l, and 
presented a simplified synthetic problem relating to prediction 
of the communication prior, which may have applications for 
block-fading channels. 

When examining the role of feedback in combating un- 
known channel, previous works mainly focused on the gains 
of rate adaptation, and here we have seen an additional 
aspect, namely selection of the communication prior, in which 
feedback improves the communication rate. The results have 
implications on competitive universality in communication, 
and suggest that with feedback, it would be possible for any 
memoryless AVC, to universally achieve a rate comparable to 
that of any finite block system, without knowing the channel 
sequence. 

When comparing the results to the traditional AVC results, 
the former setting was prevailed by the notion of capacity, and 
thus, even when feedback was assumed, it was not used for 
adapting the communication rate. Here we have shown for the 
first time, that rates equal to or better from the AVC capacity 
can be attained universally, when releasing the constraint of 
an a-priori guaranteed rate. This demonstrates the validity of 
the alternative "opportunistic" problem setting that has been 
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considered in the last decade for feedback communication over 
unknown channels, a setting which does not focus on capacity. 
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Appendix 

A. Proof of Lemma [2] 

Lemma |2] relates the exponential weighting of a bounded 
and concave real function a < F(x) < b over a convex vector 
region x e 5 C M** to its maximum. 

Proof Let x* denote a global maximum of -F'(x) in S 
(which exists since F is concave and 5* is closed). Then from 
the concavity of F for any A £ [0, 1] we have: 

F(Ax+(l-A)x*) > AF(x)+(l-A)F(x*) > Aa+(l-A)F(x*) 

(124) 

Note that the RHS is a constant. Denote Sx = {Ax + (1 - 
A)x* : X e 5} = A5'+(l- A)x*. Then due to convexity C 
S and due to the shrinkage vo1(5'a) = A'^vol(S'). Furthermore 



(bits), while the universal scheme sends a fixed number of bits 
K. Therefore the gain of the competitor miFi{Q) and the 
instantaneous regret miFiiQ) — K ait related by a constant, 
and it is more convenient to base the derivation on the gain 
rather than the regret. The potential function $ will be used 
as an approximation of the max in ( (37] i. 

Denote the cumulative gain of the competitor with prior Q 
as: 



G,;(Q)^^m,F_,(Q), 

And the potential function of Gi{Q) as: 
= $(G.(0)). 



(129) 



(130) 



Note that $i is not a function of Q due to the integration over 
Q performed by $(•). We can now write Wi{Q) as: 



by ([124]), Vx e 5a : F(x) > Aa + (1 - A)F(x*). We have: 
ye''^Wdx> f e"^Wdx= / e')(^-+(i-^)J=^(-*))dx 



' S " s 

= e''('^''+(i-^)^(''*))vol(5A) 
^ pnF{^') . e-''^(^(''*)-'^)A''vol(S') 
> 6"^^^*) •e-''^(''-'')A''vol(S'). 



Therefore, 

- In 
V 



vo\{S) 



s 



> F(x*)-A(6-a) + 



Maximizing the RHS with respect to A we obtain 

A^ ' 



(125) 



dlnA 
V 

(126) 



(127) 



T]{b-a)' 

where A < 1 by the assumptions of the lemma, and substitut- 
ing A we have: 



F > F(x*) ( 1 

V 



In 



ri{b-a) 



Rearranging yields the desired result. 



Fi^*yd^^rje(b_al^ 

rj a 

(128) 

□ 



B. Proof of Lemma [4] 

During the course of the derivation below we attempt to 
optimize the asymptotical form of the loss (up to constant 
factors), and thus we make simplifying assumptions on the 
parameters, which hold asymptotically for large enough n. For 
finite n these assumptions might lead to suboptimal results. 
We do not discuss the assumptions during the course of the 
derivation and we collect them at the end. All integrals below 
are by default over the unit simplex Q S A;f . 



In the block- wise variation setting (Section IIIi, our target 
was to control the growth rate of the regret. Here, at each block 
i, by dSTji the gain of the competitor using prior Q is miFi{Q) 



Wi{Q) 



— 1 



(131) 



The growth of the potential is bounded by: 



f J w,[l + rjm,F,{Q) + 7fm^F,{Qf] dQ 



I + 1] I WirriiFidQ + t]"' I Wim^^FfdQ 



(132) 



where in the last inequality we used Lemma [T] and assumed 
■qmiFilQ) < 1. The dependence of Fi and Wi on Q is sup- 
pressed for brevity. We now bound the integrals / WirriiFidQ 
and J WirafFfdQ. The property that a badly chosen prior 
may cause the iterative system to get stuck (not transmitting 
any block) translates into the fact that without placing any 
limitations on Qi, the competitor's gain, miFi{Q) may be 
unbounded, since nii might be indefinitely large while Fi{Q) 
can be any positive value. This is prevented by mixing with the 
uniform prior, which enables us to link Fi{Q) with Fi{Qi). 
Since in the context of the lemma we do not assume Fi{Q) 
is the mutual information, we use a bound which is slightly 
looser than Shulman and Feder's ( |45] l, but is based on the 
same technique |19|, and only assumes concavity. 

Define x + 2; as modulo-addition over the set X, and write 

^(^) " \h\ ^zex Qi^ + Q' i-^- express the 

uniform prior as the mean of all cychc rotations of Q. Using 
concavity and non-negativity of F: 



(133) 



zex 
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Because the prior ( (35] l has the structure Qi = {I — X)Q' + XU, 
by the concavity of Ff. 



VQ, i : F,{Q,) > (1 - X)F,{Q') + XF,{U) 

(1331 A 



(134) 



Using ( |134| l in conjunction with ([39| we have 

A 

\X\ 

X 



miFi{Q) 



-miFi{Qi) 



(135) 



which yields a bound on the competitor gain in each block. 
We now bound the two integrals appearing in ( |132| i. Starting 
with the first integral, using the concavity of Ff. 

m,F,{Q,)^m,F, (^(1 - A) J w,{Q)QdQ + XU 

>m,(l-A) J w^iQ)F,{Q)dQ + Xm,F,{U) 

>m,(l-A) I w,{Q)F,{Q)dQ, 



from which we obtain 

w,{Q)m,F,{Q)dQ < 



K 



(137) 



1- A' 

The second order term is bounded as follows: 

w,{Q)m^F^iQ)dQ = J w,{Q){m,F,{Q)){m,F,{Q))dQ 
\X\ 

- X 
im \x\ 



'K- I w,{Q)m,F,{Q)dQ 
K 



. -K ■ 
X 1 



A 



(138) 



Recall that in the classical weighted average predictor 
IIT3I . the product of the instantaneous regret and the weight- 
ing function is guaranteed to be non positive (Blackwell 
condition). Similarly in the previous section we obtained 
/ w{Q)ri{Q)dQ < (see (|23|)). In the present case, if 
we define ri{Q) ~ miFi{Q) — K, then by ( |137| i we have 
/ w{Q)ri{Q)dQ < ~ K — jz^, i.e. due to the inclusion 
of the uniform prior (which is needed for iriiFi to be bounded), 
this integral may be positive, although arbitrarily small. Thus, 
we pay a price in the first order term in order to be able to 
bound the second order term. 

Plugging the bounds ( |138| l into \\32\ we have: 



137lJl38| 



1 + 77/ w^niiFidQ + 77^ / WimfF^dQ 



K 
1 - A 



K 

l-X 



VK-\X\ 



< ■ • ■ < ^'oe^ + — ^ — ). 



(139) 



In the last step we applied the same relation inductively. Using 
( |139| l we can obtain a bound on ^{Gb+i{Q)), and we now 
use Lemma [2] to relate this bound to Gb+i{Q) and to the 
target rate. Rt — maxq Gb+i{Q)- The dimension is d = 
dim(AA') = \X\ - 1. By ([37| we have 



0^< Gb+i(Q) < n-max(i?T,/n 



(140) 



The reason for setting the upper bound as 6 = n ■ 
max(i?T, -fmax) rather than just n ■ Rt, is technical, as this 
simplifies the conditions required to meet the requirements of 
the lemma. To satisfy ri{b—a) > d we only require rj > ■ 
By Lemma |2] and ( |139[ ) we have: 



Gb+i{Q) 



1^^^$(Gb+i(Q)) 
V 



nSiiRr) 



1, ^B+i , 

= ~ In — r h 

77 $0 

(i^ K -{B + l 



m 

nSiiRr) 

, VK-\X\ 



1 - X 



X 



(136) where 



Si{Rt) = 



nrj 



In 



?7en max(flT,-fmax) 
\X\-l 



nSiiRr), 
(141) 

(142) 



is the redundancy term introduced by Lemma |2] Bounding 
Rt using ( |141| ), while substituting K{B + 1) = KB + K = 
nR + K, we obtain: 



Rt = -maxGB+iiQ) 
n Q 



< R 



1 - X 



VK-\X\ 
X 



5i{Rt). 

(143) 



After rearrangement we have the following bound on R: 



R>{Rt-&i{Rt))-{1-52)-5^, 



where 



VK-\X\ 
X 



(1-A) 



5,. ^ 



K 



(144) 

(145) 
(146) 



The rest of the proof of Lemma |4] is an algebraic derivation 
focused on simplifying and optimizing the bound above. The 
lower bound on R in the RHS of ( |144| l is increasing with 
respect to Rt- This is since g^^i is zero for Rt < /max 
and for Rt > /max the derivative g^Si is which 

lAfl — 1 

by our assumption 77 > '-^ — is smaller than 1. Therefore 
Q^iRr — ^i{Rt)) ^ 0. In order to optimize the parameters, 
we assume for now that Rt < /max and bound the difference 
R — Rt- Using > 1 — t we have 



1 - <5, > 1 



VK-\X\ 
X 



(1-A)>1 



r^K- \X\ 



- X. 
(147) 
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From ( |144| i, under the assumption Rt < /max we have: 

i?> (i?T-'5i(/max))-(l-<52)-<53 
> Rt — <5i(/max) — ^2 • /max ^ ^3 

ri-K-\X\-I, 



>i?T -<5i(/n 



A 



— A • /max ^ ^3 I 

(148) 



We further simpHfy (^i(/max) by making the assumption that 
Tl < ^ and therefore In ( ""'"Yifi^— < 0, and 

'5i(/max) < • ln{n). Using these simpHfications we 

further bound the RHS of ( |148| l by /?t - \rod where 



Ap„, = / 



max • ^ + 'I]' 



■m-i) 



ln(n) 1 
n rj 



+ 5^, (149) 



=61 



and Cf)^ K ■\X\- /max- 

Applying Lemma [3] to the optimization of the two terms 
depending on rj in ( |149| l (marked ai,6i, with powers a = 
1,13 — 1) we have: 




(150) 



and 



= /max • A + + S3 



, , , / co(|A-|-l).ln(n) , ^ 
= /max • A + 2W h da- 



nA 



(151) 



Substituting cq yields Ap^^ and 77 stated in the Lemma. 



Now, the derivation involving equations ( 148 | l - ( |151[ ) assumes 
Rt < /max- Since the lower bound ( |144| l on /? is increasing 
with respect to Rt, in the case that Rt > /max we are 
guaranteed to obtain a better lower bound on R than the 
lower bound R > /max — Ap„d attained for Rt — /max 
(in other words, the RHS of ( |144| i for Rt — /max is at 
least /max — Ap,„d). Therefore the bound can be stated as 
R > min(/?T, /max) - Ap„<i- 

We now collect the various assumptions we have made 
along the way. We use the same technique used in the proof 
of Theorem [T] of showing that if the assumptions do not hold 
then (possibly under some simple conditions), Ap^^ > /max 
and therefore the lemma holds in a void way (since the RHS 
of (|40i becomes non-positive). 

In ( 132[ ) we assumed rjmiFi{Q) < 1. Using the upper 
bound of ( |135| l we have the sufficient condition rj^-^K < 1 
If this condition doesn't hold, i.e 
term in ( |149| l satisfies — 

Apred ^ /max 



,1-^1 



K > 1, then the second 



\V — \ ' -'max ^ -'max? SO 

and the lemma holds in a void way. Before 
149[ ) we assumed r] < . When the opposite is true, then 
second term in ( |149| l satisfies the ^rj — ^'^^^"^ ■ /„,a,x > 
"^•'•^[.^'f' > l -K > f.By requiring K > 2/max we 
have that in this case the Lemma will also be true in a void 



way. 



To use Lemma 



the opposite is true 



'I 



If 



we required 77 > — — ^. 
en the third term in \\A9\ satisfies 



> /maxln(n), and thus if n > e, Ap„d > /n 



Therefore, by requiring n > e and K > 2/max, we have that 
if any of the assumptions we made does not hold, the lemma 
is true in a void way. This concludes the proof of Lemma |4] 
□ 



C. Proof of Lemma |6| 



In this proof we use nats (log-s are natural base). This does 
not change the results since all values scale according to the 
base of the log-s. Also, we assume all probabilities and false 
probabilities are non-zero. It is easy to check that the results 
for zero probabilities follow by replacing zeros with small 
probabilities and taking the limit using p\ogp — > 0. 

Non negativity Define p{y) = J2x Qi^)^{y\^) and write: 



-i{Q, w)^Yl Qi^)My\^) log ' ^^^^ 

< ^Q(a;)T^(y| 



x,y 
logi<t-l 



W{y\x)J 
\W{y\x) J 



= 1-1 = 0. 



(152) 



Concavity with respect to Q: Denote as above p{y) 
J2x Q{x)W{y\x) and write: 



HQ,W) = J2Qi^W{y\x)\og 



W{y\x) 

P{y) 



^^Q{x)W{y\x)\ogW{y\x) -^p{y) logp(y). 

x,y y 

(153) 



The left hand term is linear with respect to Q. The function 
tlogt is convex in t (for all t > 0), and p{y) is linear in 
Q, therefore the right hand term is convex in Q, and so / is 
concave with respect to Q. 

Convexity witli respect to W: Let Xi > 0,J2 = 1' and 

W{y\x) = J2^^^W,{y\x)■ We prove that A ^ I{Q,W) - 
X)i XiI{Q, Wi) < 0. Define the respective output distributions 

as Piiy) = T,x QixW^{y\x) and p{y) = Q{x)W{y\x) = 
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Si ^iPi{y)^ then we have 

i 

= Y,Q{x) log 

-Y,^^Q{^)W^{v\x)\0g 




x,y,i 



Pi{y) J 

W{y\x) ■P,{y) 
Wi{y\x) -piy) ^ 

(W{v\x)-Uv) 



<^A.Q(x)W^.(y|a;), 

\W^{v\x) ■ p{y) 



- 1 



x,y,i PxV ) x.y,i 



J2Q{^)W{y\x) - Y,Qi^)W{y\3 



0. 



(154) 



Roundness: Since 'Ex' Qi^')'^iyW) ^ W{y\x)Q{x) 



we have log ^ 



Now write: 



J(Q,VK)^EQ(a;)W^(y|a:)log 



,Q(:£)lV(y|x; 



< 



J2QixW{y\x)\og 



Xx'Qi^'WivW) 
' 1 



^Q{x) 



< '^Qix) log 



/ 1 



a-H[Q) <a-log|A'| 



(155) 



D. Proof of Lemma |S] and Lp bounds on differences of 
entropies and capacities 

In this section we prove Lemma [8] relating the Lp norm 
difference of two channels (one of which may be a false 
distribution) to the difference in capacities. Two intermediate 
results that are captured in Lemmas 1 1|12 are an extension of 
the Li bound of Cover & Thomas to false distributions and a 
trivial extension of the same bound to Lp norms. 

We begin with the following Li bound on entropy from 
Cover & Thomas E3l: 



Lemma 10 (Li bound on entropy, Theorem 7.3.3 of ||23]| ). 

Let Q, P be two distributions on the finite alphabet y with 

HQ - ^'lli < \, then 

'\\Q-P\\i 



\H{Q)-H{P)\ < -HQ -Pill -log 



13^1 



(156) 



Also note that the function —t log yjjy is monotonous non 
decreasing for t < e"^|3^|, as can be verified by differentia- 
tion. Our first step is to extend the lemma to a case where one 



of P, Q is a false distribution. In Cover and Thomas' proof, 
the first step is to write entropy as H{P) — ^ f{P{y)) 
where / = — ilogi and to show that for all < u < 5 
and Q < t < 1 — V, the difference in / is bounded by 
\f(t + v) — f{t) I < V log V. Here t represents the minimum of 
P{y), Q{y) for a certain y, v the absolute difference, and t + w 
the maximum of P{y),Q{y). Then, the difference in entropy 
is bounded by the sum of the absolute values, this bound is 
substituted in the summation, and convexity arguments are use 
to bring it to the desired form. The only step that needs to be 
modified is showing that \f{t + v) — f{t)\ < vlogv, where 
now t is no longer bounded to t < 1 — u. It can be verified by 
differentiating the function g{t) = f{t + v) — f{t) with respect 
to t that the derivative is always negative for u > 0. In addition, 
g{0) > 0, therefore the maximum absolute of this function, 
which is the absolute value of either the the maximum or the 
minimum, occurs at either end of the region to which t is 
limited. In the original proof this yields \f{t + v) — f{t)\ — 
\g{t)\ < max(|g(0)U<?(l - i;)|) = max(/(i;), /(I - t;)) = 
~v\ogv (notice that /(O) = /(I) = 0). Here, since one of 
P, Q is a legitimate distribution, t < 1 (as the minimum of 
the two) and we have instead: \f{t + u) — f{t)\ — \g{t)\ < 
max(|g(0)|, |5(1)|) = max(/(u), —f{l+v)). As we will show 
below, if we limit f < | we have f{v) > —/(I + v), and 
therefore the bound \ f(t + v) — f{t)\ — \g{t)\ < f{v) applies 
as in the original proof and Cover & Thomas' result holds. 
To show this, consider the function g{v) — —v\nv — {v + 
1) ln(t; + 1). This function is for v = 0, and the derivative 
is g'{v) — — Inw — 1 — ln(u + 1) — 1 = — ln(v(w + l)e^), it is 
positive in a certain interval (0, wi) and negative for v > vi, 
and therefore it crosses only once. Calculating this function 
for V — ^ yields a positive value, therefore it is positive for all 
V < \- We capture this variation of Cover & Thomas result 
in the following lemma: 

Lemma 11 (Li bound on false entropy difference). Let P 

be a distribution on the finite alphabet y and P be a false 
distribution on the same alphabet, with \\P — P\\i < |, then 



\HiP)~HiP)\ < -||P-P||ilog 



where the false entropy H is defined as 



|P-P|| 



H{P) 



Piy)logPiy). 



(157) 



(158) 



We first convert the bound to the Lp norm (p > 1). To 
relate the norms we use Holder's inequality: for two vectors 
a, b, J2i Wibi\ < \\a\\p ■ \\a\\p, where p^^ = 1 - p^^ is the 
Holder conjugate of p and by convention for p = 00 we define 
1/p = (note that p > I and the conjugate of p = 00 is 
p ~ 1). We have 

IIP - P||i = E 1 • |P(y) - P(y)| < IIP - P||p • \\l\\p 
y 

^ \\p - p\\p ■ iT. 1^)'^^ = \\p-p\\p- \y\'/^ 

yey 

^\\p~Ph-\y\'''^^. 

(159) 
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Assuming \\P — P\\p ■ \y\^ < e ^\y\ we can use the 
mono tonicity of the bound of Lemma [TT] and write: 



|^(P)-i/(P)| <-||P-P||ilog 



\P-P\\i 

\y\ 



<-l|p-p||p-i:vr-^/nog^J^ 



fp{\\P-P\\: 



p / ' 



where we defined 



/.W = -t.|3^|-Vnog(^) 



(160) 



(161) 



fp (t) is concave with respect to t (because —t In t is concave 
in < > 0), and is monotonically non decreasing for t < 
e^^\y\^^P, as can be verified by differentiation. Furthermore, 
to meet the requirement ||P — P||i < j of Lemma |TT| it is 
sufficient that \\P-P\\p ■ \y\'^^^/P < { (by ( fT59l )), and in ad- 
dition prior to ([T60]l we have assumed ||P — P||p < e~^|3^|^/P, 
however it is easy to see that this condition is dominated by the 
previous one. Since 1 — 1/p > 1, and |3^| > 1, it is sufficient 
to require ||P — P||p < |. In summary, we have the following 
result: 

Lemma 12 (Lp bound on false entropy difference). Let p > 1, 

P be a distribution on the finite alphabet y, and P be a false 
distribution on tlie same alphabet with ||P — P\\p < \, then: 



(162) 



\H{P)-H{P)\<fp[\\P-P\\p\, 



where fp is defined in and it is concave and monoton- 

ically non-decreasing for t < j^. 

We now write the false mutual information ( |5T| as a 
difference of false entropies (|158[): 



/(Q, W)^h\Y, W{y\x)Q{x) - ^ Q{x)H{W{y\x)). 

\ X / X 

(163) 

The above is analogous to the equality I{X; Y) = H{Y) — 
H{Y\X). For the channels W,W define the difference as 
Sxy — W{y\x) — W{y\x) and define the output distributions as 
Priv) = j:.W{y\x)Q{x) and Py(y) = j:.Wiy\^)Qix), 
then by the triangle inequality: 



\IiQ,W)^IiQ,W)\ < \H{Py)-H{Py)\ 
+ Y,Qi^) H{W{y\x))~H{W{y\x)) 



(164) 



We begin with the difference H{Py) - H{Py). By the Lp 
bound of Lemma [12] we have: 



Using the triangle inequality: 

\\Py -Py\\v = \\Y. Q{x){W(y\x) ~ W{y\x))\\p 

X 

= \\^Q{x)5xy\\p 

X 

X 



where the notation is used to emphasize that the norm 

operation is with respect to y only. Using Holder's inequality. 



(166) 



1/p 



\Sxy\\p 



(167) 




^xy 1 1 p 



Assuming < |, fp is monotonously increasing, and 

combining the inequalities above we have: 

HiPY)~H{PY)<fp{\\Sxy\\p). (168) 
For the second part of ( |164| i, by the Lp bound we have: 

H{W{y\x)) - H{W{y\x))\ < fp{\\Sxy\\p,y). (169) 
Using the concavity and monotonicity of fp: 

Y,Q{^)\H{W{y\x))-H{W{y\x)) 

X 

^xy\\p,y) 



<fp\Y,Q{x)\\ 5xy\\p,y 



(170) 



fpiW^xyWp) 

where the monotonicity of fp is again guaranteed by the 
condition ||(5,v„IL„ < 
we have: 



xy\\p - i- Plugging ([T68ll and (fTTOji into ([T64]| 



i{Q,W)-I{Q,W) <2/p(|l<5,,||p) 



(171) 



which proves the bound on mutual information. The bound 
on capacity is trivially obtained from ( |171| i above by writing 
I{Q,W) > L{Q,W) - 2fp{\\5xy\\p) and maximizing both 
sides with respect to Q (and similarly for the other direction). 

□ 

E. Proofs of small Lemmas 

Proof of Lemma |7J We would like to prove that 1 + a; < 
<1 + X + x"^. Using a finite tailor series we have: 



H{Py) - H{Py) < fpiWPY - PyWp). (165) 



1 + e" ■ x + \e*x^, 



(172) 
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where t £ [0, U [x, 0] is a point between and x. This proves 
the lower bound. Also, for a; < since e* < 1 this also proves 
the upper bound. For < a; < 1, the right inequality can be 
made tighter, by writing the full Tailor expansion: 



oo 

E: 

m=0 



1 



< 1 



= 1 



rn=2 



1 



-E 

1 + X + x^{e^ - 1 



1) 



= l + x+{e-2)x'^ < 1 



(173) 



□ 



Proof of Lemma [ij f{t) is continuous and differentiable 
therefore f'{t) = at the maximum. Derivation yields f'{t) = 
aa-t"^^ — f3h-t~^~^ , and f'{t) — yields the single solution 
t* stated in the Lemma. This is a single maximum since f'{t) 
is positive for t < t* and negative for t > t*. □ 

F. Proof of Theorem |2j the optimality of averaged channel 
capacity 

In this section we prove Theorem |2] presented in \ IV-A 
(regarding the optimality of C{W)). For a given sequence W", 
consider the "permutation" channel generated by uniformly 
selecting a random permutation 11 of the indices i — 1, . . . ,n, 
rearranging the sequence W" to a permuted sequence T, = 
W^^, and applying the channel Pr(Y|X,7r) = ]\,T,{Y,\Xi) 
to the input (i.e. using the channels Wi in permuted order). 
Suppose there is a system achieving the rate — A 

with probability 1 — (5 and error probability e. Since this rate 
is fixed for all drawing of 11, the system can guarantee the 
rate R{Wi ) — A a-priori (with probability 1 — 5), and we can 
convert the rate-adaptive system to a fixed-rate system, deliv- 
ering a message m of n{R{Wi ) — A) bits, with probability 
of error at most e + (5. Once we constrain the discussion to 
the permutation channel induced by the deterministic sequence 
W", we can assume this sequence is known to the transmitter 
and the receiver. 

By a standard application of Fano's inequality ll23l Theorem 
2.10.1], we have: 



/(m; Y) = H{xn) - H[xa\Y) 

>n(i?(T4^i") - A)(l 



{e + 5))-h^{e + 5). 

(174) 



Rearranging and using hi,{p) < 1 we have: 
-lixn-Y] + - 



(175) 



1-e-S 

In the main part of the proof we will show that approx- 
imately, ^/(m;Y) < C{W). Note that because of feed- 
back, Xi may be a function of m and Y'~^, and therefore 
/(X"; Y") does not give a tight bound on the rate. As noted 
in the outline presented in Section IV-A if the channels Ti 



were selected from VF" with replacement, this result would be 
obvious, since feedback would not be helpful. In the permuted 
channel, a system with feedback can use past channel outputs 
to gain some knowledge about the future behavior of the 
channel. The point of the proof is to show that there is no 




\ 

i 

/ 



Xi Yi Tj TTi 



Fig. 5. A dependence graph for the variables of the permutation channel in 
Appendix |F] Each node is a (potentially random) function of the nodes with 
an'ows pointing toward it. 



considerable gain from this knowledge, and even a knowledge 
of the actual list of channels that were already picked does 
not change the mutual information considerably. 

We denote by 11 the random permutation and by vr a specific 
instance of the permutation. We bound the mutual information 
as follows: 



/(Y";m)=^/(r,;m|Y'-i) 

n 

^ J2 {H{Y.\Y^-') - H{Y,\Y^-\m)) 

(a) ^ 

< Y,{H{Y,)-H{Y,\Y^-\m,U^-\X,)) 
1=1 



(176) 



where (a) is because conditioning reduces entropy (used 
twice), and (b) is since Y*^^, m o T^^^,Xi ■(-> Yi (in other 
words, ir'^^^jXi gives all relevant information on Yi). This 
can be seen from the functional dependence graph in Fig|5] 
Let Zi be a random variable generated by passing Xi through 
the channel W (i.e. Pr(Zf |Xf ) = HLi W{Z,\Xi)). Next we 
show that H{Y,) w H{Zi) and H{Yi\W-^,X,) w H{Z^\Xi). 

Given 11*^^, the channel law between Xi and Yi is a random 
pick from the group of n — i + 1 channels that are not included 
in {n,}}-l: 

Pr(Y,^y\W-\X,^x) 

n 

= ^Pr(r, = 2/|n'-i,n, - fc,x, - x) 



fc=i 



PY{U,^k\U'-\X,=x) 



(177) 



E ^^(y\ 

A;^{n'-i} 



X) —^Wu^-i{y\x). 

n — t + 1 



The average channel given the past indices W-„-i-i{y\x) is an 
average of n — i + 1 values < Wk{y\x) < 1. Note that the 
indices k belong to IT", so the notation may be confusing, but 
it is used to stress the causal dependence on n*~^. 
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Considering the random variable Wui-'^iulx) generated by 
calculating this channel over all drawings of 11, the set k ^ 
{n*^^} becomes a random set of n— i + 1 distinct indices from 
1, . . . , n, chosen uniformly from all such sets. W^i-i {y\x) is 
an average of n — i + 1 values < Wk{y\x) < 1, sampled 
uniformly without replacement from the set {Wk{y\x)y^^^ 
(for any specific x, y). It was shown by Hoeffding ifTTl 
§6] that averages of variables sampled without replacement 
obey the same bounds [17, Theorem 1] with respect to 
the probability to deviate from their mean, as independent 
random variables. Specifically, applying Hoeffding's bounds 
(combining Theorem 1 with Section 6 in LITJ ). and since 
E [Wn.-i(y|a;)] = W, we have: 



Pv{\Wn^-i{y\x) -W\>t}< 2e-2(n-»+i)t^ 



(178) 



Using the union bound over all \X\ ■ \y\ values of x,y (see 
the proof of Proposition [3]l, we have: 

Pr{||Wn.-i - Wiloo >t}< 2\X\ ■ |3;|e-2(«-»+i)*', (179) 

where the Loo norm is over x,y. To further simplify, we pick 
a small value eg, and from now on we assume i < (1 — eo)n. 
Substituting in ( |179[ ), we have: 

Pr{||Wn.-i - W|U >t}< 2\X\ ■ |3;|e-2^«"*' ^ p, (180) 



Since H{-) is uniformly continuous (see Lemma 12 1, for any 
eo there is a t such that if ||Pi(y) — P2{y)\\oo < 2t then 
\H{Pi) ~ H{P2)\ < Eq. For a given we choose the value of 
t such that this requirement is satisfied, so that together with 
(11801) we have: 



yx :Pi{\H{W^,-i{-\x)) - H{W)\ <eo}>l-p. (181) 

We use the following relation to translate proximity in prob- 
ability to proximity of the expected values: if A,B £ [0, j4,„ax] 
are two random variables satisfying Pr{|A — B\ < e} > 1 — p 
(for some e,p £ [0, 1]), then 

|E[^] - E[B] I = E[{A - B) ■ Ind(| A - B\ < e)] 

+ E[{A-B)-lm\{\A-B\ > e) 

<E[\A- B\-lnd{\A- B\ < e)] 

+ E[\A- B\-lTid{\A- B\ > e)] 
<e + E[A^,^-lnd{\A- B\ > e)] 

!i <^ + ^max ■ P- 

(182) 

Applying this inequality to bound H{Yi\W^^ , Xi) we have: 

■ Pr(n = 7r,Xj = x) 
^ J2 H{W^.-. i-lx)) ■ Pr(n = n, X, = x) 

= E[H{Wn^-i{-\X,))] 

^^^E[HiWi-\X,))]-eo~log\y\-p 
= HiZ,\X,)-eo~log\y\-p. 

(183) 



We now show that the distributions of Yi and Zi are similar 
(note that they are not equal, due to the possible dependence 
of X, on n*-i). 

|Pr(y, -y)-Pr(Z, 

= \E[PT{Y,^y\W-\Xi)] ~E[PY{Z,^y\Xi)]\ 



\E[Wn^-^{y\X,)] -E[Wiy\X,)]\ 



t+p. 
(184) 



Since for any £o,t we have p — > ( |180| l, we can 

n— f oo 

choose n large enough such that p < t and we have 
|Pr(yi — y) ~ Pr(^i ~ y)\ 1^ 2t. Then, by our selection of 
t (before ( |181| l), we shall have: 



\H{Y,)-H{Z,)\<eo. 



(185) 



Returning to ( |176| l, and treating the first (1 — eo)?! and the last 
eon symbols separately, we have: 

n 

I{Y^;in)<J2{H{Y,)-H{Y,\W-\X,)) 



i=l 



(l-eo)n 



-{H{Z,\X,)-eo~log\y\-p) 
+ eo • n • log 13^1 

71 

< V I{Z,-X,) + n (2eo + (eo + p) ■ log |3^|) 



i=l 



<n-C{W)+n5Q. 



(186) 



Because eo is a parameter of choice, and for any eo , i we have 
p — > ( |180| l, we can make 5q a small as desired for n large 

ra— >-oo 

enough. Returning to \115\ we have: 



R{W{') < 



CiW) 



l/n 



A 



< iC{W) + 5a + l/n)(l + e + S) + A 

< CiW) + [So + l/n)(l + e + (5) + (e + <5)/„,ax + A, 



Si 



(187) 



where /max is defined in ( [T9] l. Since by Definition[T[ the above 
must hold for every e, S, A, for n large enough, and Sq — !• 

ri— >oo 

(see ( |186| l and the discussion following it), we can make Si as 
small as desired by taking n — > cxi. This concludes the proof 
of Theorem [J] □ 



G. Proof of Lemma [5] 

Clearly, using the assumptions of this section, Fi{Q) ~ 
I{Q, Wi) satisfies the conditions of the lemma. The lemma 
assumes there are S + 1 blocks and the rate is which 
corresponds to a case where the last block was not decoded, 
however it holds as a lower bound even if the last block was 
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decoded. We now optimize the value of A. Starting from ( |4T] l: 



a;., (A) - - + /,„ax • A + cn/ A-i (188) 



we determine A using Lemma |3] (with a = 1, f3 — |) and 
obtain: 



a:„,(a*) 




3{KX\{\X\-l))hL^ 



K\ 3 



< 



(a) fX\ 3 

< ; 

n 



= 4- if 3 -1X13 -li,^ 



3(1^-1 -/.nax) 3 lni(7i) + 

i-i\X\-I„,,^)hJ{n) 
2 On(7i) 



if 



(189) 



where in (a) we assumed K < \X\ ■ n ■ imax- If the contrary 
is true, the first term in ( |188| l yields Ap^^ > ^ > ^max and 
the theorem is true in a void way. Similarly, we do not have 
to worry about the case A* > 1 since also in this case, due to 
the second term in ( |188| i, Ap„d > ^max- 

If the conditions of Lemma |4] are satisfied, we have for all 

Q m-. 



V 1=1 

B + 1 



R > min ( y ^ . I{Q, W,) - Ap,„„ i„ 



> 



Ap„ 



(190) 



/(g,w^)-Ap,.„ 



where we used the convexity of I{Q, W) with respect to the 
channel W. Maximizing both sides of ( |190| l with respect to 
Q yields the desired result ( |47| . 

The conditions of Lemma |4] on n,K remain as conditions 
of the theorem. The application of Lemma [3] in ( |189| l yields 
the following value of A: 



A* 



K-\X\i\X\-l)-I, 
This concludes the proof of Lemma |5] 




(191) 



□ 



re {0,1,2} se{0,l} 
Chosen once Chosen i.i.d 



X 




WsriY\X) 

Fig. 6. An illustration of the generation of the channels Wsr in Example]?] 

H. Channel knowledge compared to channel estimation 

In this section we demonstrate the claim made in Sec- 
tion IV-A that even imposing on the synthetic problem only 



the limitation that the past channels are not given, but need to 
be estimated, leads to the conclusion C2 is not attainable. 
To show this we use an example, based on randomization of 

we assume I{Qi, Wi) 
(in other words, this 
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the channel sequence. As in Section 
bits are transmitted in time instance i 
is the gain obtained in retrospect for choosing Qt), however, 
instead of knowing the fuU channel sequence, the predictor 
is only allowed to base its decisions on measurements of the 
channel input and output, i.e. on the values of (Yf \Xi"i) 
where Yi is the result of Wi operated on Xi. It would make 
sense to also require that Xi be distributed Qi{x) but this 
assumption is not required for the counter example. 

Example 4. Consider a ternary input binary output channel. 
We will choose the channel randomly, and consider the average 
gain of the predictor and the reference (since the average 
regret is a lower bound for the maximum regret). The basic 



channels are Wi = 



! 



Wo = 



Note 



that in the two channels, the first input is useless, and using 
only the two last inputs yields a rate of 1 bit/use. We add to 
this family of channels all 3 possible cyclic rotations of the 
inputs, and term the channel WJ (s = l,2;r — 1,2,3). The 
resulting channels are depicted in Fig. |6] Now we generate 
the sequence of channels as follows: choose r randomly (one 
for the entire sequence), and choose a random (uniform, i.i.d.) 
sequence of s^-s. The competitor, knowing r, easily selects a 
prior that optimizes ^ • i(Q, Wi), since W[ and W2 have the 
same optimizer for each r, and achieves a rate of 1. Because 
of the random generation of the sequence s^, for any value 
of r, the channel output Y^^^^ is uniform i.i.d. over {0, 1} 
and independent of the input. Therefore the predictor cannot 
infer any information on r from the input-output distribution. 
Therefore the best the predictor can do (in terms of optimizing 
for the worst-case r), is place a uniform prior over all 3 inputs, 
and therefore obtain a rate of |, i.e. a regret of | bit per 
channel use. By increasing the size of the channel input, this 
gap can be increased indefinitely. 

The conclusion from the example is that C2 cannot be 
attained universally when actual channel measurements are 



31 



used. 



/. An analysis of the prior quantization approach 



In Section VI-G we mentioned an alternative of using a 
"codebook" of priors, instead of the exponential weighting 
scheme over the continuum of priors, which was used in 
this paper. Following is a rough analysis of this approach, 
for the block-wise variation setting. We first determine the 
accuracy required of the codebook. Suppose we have two 
priors Qi,Q2 with \\Qi — Q2\\oo < A, and for a certain 
channel W the resulting output distributions are Pi , P2 re- 
spectively (P„i = J2xQ"ii^)^iy\^)''^ = We write 
IiQm,W) = i/(P„0 - j:^QMH{Wi-\x)) (output en- 
tropy minus output entropy given the input). Since by defini- 
tion |1 Pi- P2 Hoc < |'^MIQi-Q2||oo,by using Lemma [TI] we 
have \H{Pi) - H{P2)\ < foc{\X\ ■ A), and since the second 
term in I{Qm, W) may change by at most log \X\ - A, we have 
|/(Qi, - I{Q2,W)\ < fooi\X\ ■ A) + log \X\-A^ Ai. 
Therefore 

quantization to A/ = 

1 

0{n 2 ) (here, Qi represents the any prior, and Q2 represents 
the closest point in the codebook). To have a density of 



in order to bound the loss due to the codebook 
1 

0(121^)2 we need to have A = 



(l^l-i) 



0{n 2) per dimension, JM — (J\^^'-'''' "'J points are 

required. Now, since maxg ^ ^"^-^ /(Q, W^) differs from 
max™gi_..._Ar ^ X^iLi ^(Qm, Wi) by at most A/, we can now 
consider the problem of competing against the N priors 
(considered as N experts). The best normalized redundancy 

than can be attained is O 



^ j = O (^Y (see the 

lower bound Ill3i Theorem 3.7] and the upper bound lfT3l 
Corollary 2.2] in Cesa-Bianchi and Lugosi's book). Note that 
since the predictor loss and the codebook loss are balanced, 
we cannot gain by changing the codebook density. However, 
we have not shown that the bound on A/ is tight. 

/. Operation with any positive feedback rate 

Here we show how the scheme can be modified to operate 
with any positive feedback rate. Feedback is used in the 



scheme ^IV-B for two purposes: 



1) In order to report reception of a rateless block (we use 
1 bit per channel use) 

2) In order to send the estimated averaged channel Wi after 
the end of each block (or alternatively, the next prior 

Qi+i)- 

Suppose feedback is limited to rate i?FB- Instead of reporting 
successful reception on each symbol, we report it each A^i = 
[-pi^] symbols. The price would be wasting up to Ni symbols 
per block, which essentially form an unused "gap" between 
successful decoding of block i and the start of block i + 1. 

We now give a coarse bound on the number of bits required 
to represent the estimated averaged channel Wi. Wi is com- 
pletely specified to the transmitter by specifying the empirical 
distribution P-^y{x^y) which takes at most (m + 1)I'^N3^I 
values for a block of length m. Since m < n, the number 
of bits is at most N2 = log\X\ ■ \y\ ■ log(n + 1) = O(lnn). 



These bits can be sent over channel uses at the end of each 
block, thus forming another unused "gap" between the blocks. 
Overall the gap between blocks is iVi + = O (^)- 
Since the maximum number of blocks grows sub-linearly in 
n, the overall loss can be made negligible. 

Specifically, the effect of the additional gap on the rate can 
be analyzed using the same technique used to analyze the loss 
in the last symbol (the transition between ( |7T| and (|74]i), and 
would effectively increase the term log in '^1 < |66l ) by 

a factor of the gap 0(log?i). Since K € cj(logn) it is easy 
to see that under the same setting of of the parameters of the 
scheme, we would still have 5i — > and Ac — > 0, and 
nearly at the same convergence rate. 

A delay in the feedback link would simply mean that an 
additional fixed gap will be added between the blocks, which 
also does not prevent asymptotical convergence. 



K. Generation of the prior using rejection sampling 

As mentioned, implementation of the prediction methods 
described in this paper, which are based on weighted average 
over the unit simplex, require the calculation of integrals. In 
the below, we show an alternative method to generate the same 
results, using a method based on rejection sampling. Instead of 
explicitly calculating the predictor Q, we describe an algorithm 
that generates a random variable X ^ Q (which can be used to 
generate a letter in the random codebook), based on multiple 
drawings of uniform random variables. The number of random 
drawings required in this algorithm is polynomial in n, but still 
prohibitively large, so unfortunately it is not practical. 

First, any scalar random variable can be derived from a 
uniform [0, 1] random variable by the inverse transform theo- 
rem. A generation of the mixture of an exponentially weighted 
and a uniform distribution such as in ( (35] l, only requires to 
toss a coin with probability A, which determines whether 
X is generated using the exponentially weighted distribution 
or using a uniform distribution. Therefore the problem of 
generating the predictors described here ( [T6] l, ( (35] l, boils down 
to the following problem: we would like to generate a random 
variable X distributed according to 



Q = J w{Q)QdQ, 



where 



wiQ) 



(192) 



(193) 



and where g{Q) is a concave function and is bounded < 
g{Q) < n ■ go. A is the unit simplex (which implicitly 
refers to the alphabet X). All integrals below are over the 
unit simplex. Furthermore, we would like to accomplish this 
without computing any integrals. 

The first observation is that instead of generating an X from 
Q it is enough to generate a the probability vector Q randomly 
with the probability distribution w{Q) and then generate an X 
from the (specific) probability distribution Q. The last step can 
be accomplished using the inverse transform theorem. In this 
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case we have: 

Pt(X = x) = E \Pr(X = x\Q)] 

= E [Q{x)] = I Q{x)w{Q)dQ. 



Generation of a random variable X ^ Q, (192), jl93 



(194) 



This leaves us with the problem of generating Q ^ w{Q). This 
is accomplished by rejection sampling. I.e. we first generate 
a random variable with a different distribution, and if it does 
not satisfy a given condition, we "reject it" and re-generate it, 
until the condition is satisfied. 

We first generate a probability distribution P uniformly over 
the unit simplex A. There are several algorithms for uniform 
sampling over the unit simplex |34]. A simple algorithm, for 
example, is normalizing a vector of i.i.d. exponential random 
variables. Define G{Q) = e'^siQK and a{Q) = aG{Q). We 
will determine a later on such that VQ : a ■ G{Q) < 1. 
Having generated P, we toss a coin with probability a{P) for 
"accept". If P is accepted, this is the resulting random variable 
and we set Q = P. Otherwise, we draw P again and repeat 
the process. Let A denote the event of acceptance, and fp 
denote the distribution of P which is the uniform distribution 
over the simplex. The distribution of Q equals the distribution 
of P given that it was accepted. I.e.: 

Pr{A\P ^ q} ■ fpiq) 



fgiq) = fp\Aiq) 



Pt{A} 
Pl{A\P - q} ■ fp{q) 



a{q) 



vol(A) 



JPT{A\P^q}-fp{q)dq /a(g) 



vol(A) 



dq 



Gjq) 

/ G{q)dq J e'^a('i)dq 



w{q), 



(195) 



which is the desired distribution. 

To determine a, suppose we know the maximum of g{Q). 
This is usually possible since it is a convex optimization 
problem. Even if this value is not known, a bound on this 
value will be sufficient. Suppose that Q* is the maximizer of 
g{Q) and therefore also of G{Q). Then it is enough to set 
n — 1 — p-na{Q') 

~ G(Q*) ^ 



1) Compute the maximum of g{Q) (a convex optimization prob- 
lem), or a bound on it. 

2) Set Q < aC'). 

3) Draw Q uniformly over the unit simplex 1341 . 

4) Toss a coin and with probability 1 — ae''9('5) return to stepjs] 

5) Draw X randomly according to the distribution Q{x). 



TABLE II 
An algorithm to generate X < 



and the average number of iterations can be bounded: 
1 1 



N 



Pt{A} 
1 



E [Pr{yl|P}] 
1 



E [a{P)] aE [G{P)] 



d ) 



(198) 



Since r] is polynomial in n and tends to 0, N grows slower 
than n^, however this number is still prohibitively large. 
The algorithm described is summarized in Table |ll] 

L. Why "follow the leader" fails 



An important question from implementation perspective is 
the average number of iterations required. Since the probability 
of acceptance Pr{^} in each iteration is fixed, the number of 

iterations is a geometrical random variable, with mean "N = of ^ over these symbols. Therefore the normalized regret of 
p^. By Lemma |2] we can relate G{Q*) to EG(Q) and 



As noted in Section III-B the relation of the synthetic pre- 
diction problem to prediction under the absolute loss function, 
implies that the FL predictor cannot be appUed to our problem. 
Here we give a specific example to see why FL fails, based 
on the channel defined in Section IIII-BI We construct the 
following sequence of channels: the channel at « = 1 is a 
mixture of Wq with probability ^ and a completely noisy chan- 
nel Y = Ber(i). For this channel I{Q,W) = \l{Q,Wo). 
At time i = 2, the best a-posteriori strategy is q = 0. 
The sequence of channels from time i = 2 onward is the 
alternating sequence (M^i, Wq, Wi, Wq, . . .). It is easy to see 
that the resulting cumulative rates are linear functions of q 
and thus the optimum is attained at the boundaries of [0, 1] 
and qi = (0, 1, 0, 1, . . .). At each time, since the channel that 
slightly dominates the past is opposite of the channel that is 
about to appear, the FL predictor chooses the prior that yields 
the least mutual information, and ends up having a zero rate 
in time instances i = 2, . . . ,n. On the other hand, by using a 
uniform fixed prior, a competitor may achieve an average rate 



Pr 



FL would be at least |, and does not vanish asymptotically. 



bound the average number of iterations. Using the lemma we 
have: 



g{Q*) < - In 
V 



dQ 



vol(A) 



In 



rjengo - 



7] \ d J 
/ Tjengo \ 



(196) 



< -ln(E[G(P)]) + -ln, ^ , 
1] rj \ d / 



where d = jA"! — 1 is the dimension of the unit simplex. We 
obtain the following bound on a: 



-ng{Q') > 



1 



E [G{P)] ' \ d ) 



-d 



(197) 



The problem with the FL predictor is that it takes a decision 
based on a slight inclination of the cumulative rate toward one 
of the extremes. 

Note that for \X\ = 4, |3^| = 2, I{Q,W) does not satisfy 
the Lipschitz condition required in ll35l Theorem 1] for this 
strategy to work. 
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